Normal view
-
TechCrunch
- OpenAI takes aim at Anthropic with beefed-up Codex that gives it more power over your desktop
InsightFinder raises $15M to help companies figure out where AI agents go wrong
NetApp Expands Google Cloud Collaboration for Sovereign, Air-Gapped Deployments
NetApp announced an expanded collaboration with Google Cloud, formalized through a four-year enterprise agreement to accelerate the deployment of NetApp storage within Google Distributed Cloud (GDC) Air-Gapped environments. Delivered with World Wide Technology (WWT), the offering targets sovereign cloud use cases that require strict data residency, security, and operational isolation.
The joint solution integrates NetApp’s data platform with Google Distributed Cloud’s full-stack private cloud architecture. The result is an air-gapped environment that supports sensitive and classified workloads while maintaining compliance with national sovereignty requirements. NetApp positions its storage systems as secure-by-design, enabling organizations to deploy controlled infrastructure that supports modern applications and AI workflows without external connectivity.
NetApp integrates its AFF all-flash systems, StorageGRID object storage, and Trident Kubernetes storage orchestration into the GDC stack. Together, these components form what the company calls an intelligent data infrastructure. Within GDC, this architecture supports zero-trust security models, local data storage, customer-managed encryption keys, and full operational control. The platform enables organizations to extend cloud capabilities to on-premises or edge environments while maintaining isolation, or to operate in fully disconnected, air-gapped configurations.
The collaboration is primarily aimed at government and regulated industries, where data-handling requirements limit the use of traditional public cloud. NetApp leadership highlighted that these environments require infrastructure capable of handling classified data while supporting modernization initiatives. By integrating with GDC, NetApp enables enterprise-grade AI and analytics capabilities within accredited environments, allowing agencies to derive insights and automate processes without compromising compliance or sovereignty.
Google Distributed Cloud is designed to extend Google Cloud services to customer-controlled locations, including on-premises data centers and edge sites. Google noted that public-sector organizations face growing pressure to extract value from data while complying with strict regulatory frameworks. GDC addresses this by enabling the deployment of cloud-native services and advanced AI in sovereign and disconnected environments.
As part of this effort, Google has expanded the availability of its AI capabilities for regulated use cases. Gemini models are now supported in GDC environments, enabling generative AI functions such as automation, content generation, discovery, and summarization directly on-premises. These capabilities can run in fully disconnected deployments, allowing organizations to leverage advanced AI while maintaining strict security and compliance boundaries.
The NetApp and Google Cloud partnership reflects a broader trend of bringing cloud and AI capabilities into controlled environments. By combining enterprise storage with sovereign cloud infrastructure, the companies are targeting organizations that require both advanced data services and strict operational isolation.
The post NetApp Expands Google Cloud Collaboration for Sovereign, Air-Gapped Deployments appeared first on StorageReview.com.
Comino Grando RTX PRO 6000 Review: 768GB of VRAM in a Liquid-Cooled 4U Chassis
Comino recently sent us the latest version of the Comino Grando for review, configured with eight NVIDIA RTX PRO 6000 Blackwell cards, each with 96GB of VRAM, for a total of 768GB of GPU memory. We reviewed the Comino back in in 2024, configured with 6x RTX 4090s, offering 144GB of total GPU memory, as well as a version with NVIDIA H100’s. This latest update marks a substantial generational leap in both raw memory capacity and the range of workloads the platform can address.
The Grando is a purpose-built 4U platform designed to resolve the critical conflict between high-density GPU compute and thermal management. While standard air-cooled chassis crumble under the sustained 600W+ TDP demands of modern professional cards, the Grando takes a fundamentally different approach, built from the ground up around a liquid-cooled architecture capable of dissipating a massive 6.5kW of continuous heat. This is not a retrofit or an afterthought; the entire chassis, from its inverted motherboard layout to its color-coded quick-disconnect manifold system, has been engineered around the cooling loop.
The result is a platform that can sustain eight full-TDP professional GPUs in a single 4U chassis, running 24/7 in ambient environments of 3-38°C, without thermal throttling, without the acoustic assault of high-RPM air cooling, and without compromising serviceability. For organizations deploying AI inference, machine learning training, or high-performance simulation workloads at scale, the Grando offers something genuinely rare: a server that does not ask you to choose between density, thermals, and reliability.
Comino Grando Specifications
The table below shows the physical specifications and supported hardware configurations for the Comino Grando platform.
| Specification / Feature | Comino Grando |
|---|---|
| Comino Grando Server & Rackable Workstation | |
| Cooling Capacity | 6.5kW (Maximum 6 500 W @ 20°C intake air T) |
| Motherboards | Up to EATX & EBB |
| GPUs (Server) | Up to 8; NVIDIA: RTX A6000, RTX 6000 ADA, RTX PRO 6000, A40, L40, L40S, A100, H100, H200 |
| GPUs (Rackable Workstation) | Up to 6; NVIDIA: 3090, 4090, 5080, 5090, RTX A6000, RTX 6000 ADA, RTX PRO 6000, A40, L40, L40S, A100, H100, H200; AMD: W7800, W7900 |
| CPUs | Up to 2; Single Socket: Intel Xeon W-2400/2500 & 3400/3500, Intel Xeon Scalable 4th Gen, 5th Gen, XEON 6, AMD Threadripper PRO 5000WX, 7000WX, 9000WX, AMD EPYC 9004/9005 Dual Socket: Intel Xeon Scalable 4th Gen, & 5th Gen, XEON 6, AMD EPYC 9004/9005 |
| RAM | Up to 2TB |
| M2 drives | Up to 8x NVME |
| Storage | Back panel hot swap cages: up to 4x hot swap SSDs (4x 7mm or 2x 15mm) and up to 4 more (4x 7mm or 2x 15mm) instead of 4th PSU; Internal 3.5″ cage up to 4x 3.5″ or 4x 2.5″ 15mm or 12x 2.5″ 7mm; Internal 2.5″ slots: up to 4x 2.5″ SSD 7mm |
| Power Supply & Operating Voltage | Up to 4x 2000W Hot Swap CRPS @ 180-264V Up to 4x 1000W Hot Swap CRPS @ 90-140V Redundancy modes: 4+0, 3+1, 2+2 |
| Noise level | 39dB-70dB |
| Lan | Up to 2x 10 Gbit/s on the MoBo and up to 400Gbit/s in PCIe |
| OS | Ubuntu / Windows 11 (Pro/Home) / Windows Server |
| Physical & Cooling Specifications | |
| Liquid cooling | CPU with VRM and GPU with GDDR and VRM |
| Reservoir | Comino custom 450ml with integrated pumps |
| Fans | 3x Ultra High Flow 6200RPM (high noise level) or 3x High Flow 3000RPM (low noise level) |
| Installation | 19″ rack-mountable or standalone as a Workstation |
| Required rack space | 4U |
| Size | 439 x 681 x 177mm (without handles and protruding parts) |
| Weight | 4 GPUs: 49kg (net), 67kg (gross) 6 GPUs: 52kg (net), 70kg (gross) 8 GPUs: 55kg (net), 72kg (gross) |
| Operating & storage temperature range | Storage: -5..50°C / 23..122°F Operating: 3..38°C / 38..100°F |
| Comino Monitoring System (CMS) | |
| Overview | Controller Board with Sensors & Software for Real-Time Monitoring |
| Key Advantages | Cooling System & CPU/GPU Monitoring, Web Interface, Cooling System Log, Centralized Monitoring for Workgroups |
| Sensors & Connected Devices | Temperature (air and coolant), % Humidity, Voltage, Coolant flow, Reservoir coolant level, Fans, Pumps, Motherboard, Display, and buttons |
| Integration Possibilities | Establish monitoring via a REST API and push sensor data to monitoring software (e.g., Zabbix, Grafana) or databases (e.g., InfluxDB). |
| CMS Technical Requirements | |
| OS | Windows 11/10 Ubuntu 22.04/20.4 (Dependency for Ubuntu: the target system must have nvidia-smi and sensors utilities installed) |
| Web Browsers | Mozilla Firefox, Google Chrome, Chromium, Apple Safari, Microsoft Edge (Attention: Internet Explorer 11 is not supported) |
| Hard disk drive | 300MB |
| Controller firmware version | 1.0.6 or newer |
| Controller PCB version | 2.xx.xx |
Design, Build, and GPU Density
Chassis Layout and Deployment
The Grando Server is a masterclass in space optimization, measuring 17.3 x 26.8 x 6.97 inches (4U). Unlike traditional servers, it places the motherboard’s rear at the front of the chassis, inverting the conventional internal layout. This ensures that air-cooled components, such as RAM modules and VRMs, receive the coldest possible intake air before it reaches the liquid-cooling radiator at the rear.
The chassis itself is built to the same exacting standard, featuring solid steel construction with a matte black powder-coat finish applied inside and out. This deliberate choice extends to the tubing, cables, radiator, and PCB solder mask, reflecting a clear intention for a clean, professional aesthetic throughout. Furthermore, the system supports versatile deployment, functioning seamlessly as either a 19-inch rack-mountable unit or a standalone desktop unit. Depending on the configuration, it weighs between 148 and 159 lbs.
GPU Cold Plates and Water Blocks
The proprietary copper water blocks form the core of the Grando’s density, cooling not only the GPU die but also the other components like memory and voltage regulators. Each GPU ships as an off-the-shelf card, on which Comino mounts a custom cold-plate assembly. In practice, this thin-profile design reduces each card to a single-slot footprint, allowing six or even eight professional GPUs to sit side by side within a single 4U chassis. Our review unit shipped with eight NVIDIA RTX PRO 6000 Blackwell cards, each with a TDP of 600W, resulting in a total cooling requirement of 4,800W under full load.
Achieving the Comino’s 8 single slot GPU density would be nearly impossible with air cooling, since stock NVIDIA RTX PRO 6000 cards each occupy two slots and require substantial airflow. In contrast, these custom-cooled cards occupy just one slot each. The cold plates are built solidly, adding noticeable weight to each card, but that weight reflects the quality and cooling performance required at this level.
Each pair of GPUs is plumbed through a dedicated sub-manifold that consolidates both cards into a single inlet and outlet connection to the main coolant manifold. This paired approach simplifies the overall loop architecture, reduces the number of connections at the main manifold, and allows a technician to disconnect a single pair of quick-disconnect couplings to remove two cards at once, further streamlining maintenance.
Water Distribution and Manifold
At the center of the system sits a large water distribution manifold that supplies cool liquid to each GPU and CPU cold plate and provides the return path to the radiator. All connections between the manifold and the GPU’s and CPU use Comino’s “TheQ” Quick Disconnect Couplings. These stainless steel dripless fittings are color-coded with red and blue rings to clearly identify the hot and cold sides of the loop, removing any ambiguity during installation or servicing.
They leave minimal residue on the mating surface when disconnected, allowing technicians to remove or replace individual GPUs or the CPU without draining the 450ml reservoir or the rest of the loop. In this way, the Grando brings the maintenance simplicity of air-cooled systems to a high-performance liquid-cooled platform.
CPU Cooling and Memory
The CPU and its voltage regulators also benefit from a dedicated cold plate connected directly to the coolant loop, preventing the processor from becoming a bottleneck during intense multi-GPU workloads. Our review unit shipped with an AMD Turin/Genoa board featuring a single AMD EPYC 9474F 48-core processor. The cold plate mirrors the quality of the card cold-plates, machined from solid copper and secured with stainless-steel hardware.
Flanking the CPU on both sides are eight fully populated DRAM slots that support configurations up to 2TB of RAM. Our review unit came equipped with 512GB of DDR5 RAM. A support bar spans above the GPU and CPU area of the chassis, perpendicular to them, securing sensitive components like the GPU’s and maintaining chassis rigidity during transport.
Radiator and Fans
Cooling is handled by a large triple 140mm radiator mounted at the rear of the chassis, paired with three high-speed 140mm fans capable of reaching 6,200 RPM and moving up to 1,000 m³/h of airflow. The dense fin stack provided by the thick radiator underscore the thermal headroom designed into the platform, which is built to dissipate up to 6.5kW of sustained heat in our configuration.
What is perhaps most surprising is that despite that workload and those fan speeds, the unit manages to stay within a tolerable noise envelope, with sound levels sitting 70+ dB at full tilt. That is loud by workstation standards but notably restrained for a system dissipating the thermal output of a small electric furnace, which speaks to how effectively the Comino’s liquid loop transfers heat away from the components.
Front Panel and Telemetry Display
On the front panel, an LED display provides a live readout of key telemetry data, including pump status, ambient air temperature, coolant temperature, and fan speed. Users navigate the menu using illuminated buttons on the cooling module, with short presses to scroll through available data. A long press on the PB2 button opens additional menu branches, including Commands, Service settings, and an Event Log. In addition, the front I/O panel includes a VGA port for display output, alongside a serial port, multiple USB ports, and network connections for peripheral and device connectivity.
Power and Storage Architecture
Power Delivery and Redundancy
Supporting this level of compute requires equally robust power delivery. The Grando supports up to four hot-swap 1000W or 2000W CRPS modules in a redundant configuration, delivering up to 8.0kW at 180–264V. With support for 4+0, 3+1, and 2+2 redundancy modes, the system can tolerate PSU failures while maintaining continuous operation for 24/7 AI and HPC workloads.
Our review unit shipped with four Great Wall 2000W 80 Plus Platinum hot-swap power supplies, forming the full 8.0kW configuration.
Power delivery to each GPU runs through a centralized 12-pin power distribution board mounted between the GPU array and the main cable run. The Grando uses this distribution board to consolidate incoming power feeds and then branch them to each GPU in an organized, space-efficient manner.
PCIe, Storage, and Networking
The Grando comfortably supports six GPUs without compromising slot bandwidth, and the chassis scales to a full eight-card configuration for maximum density. The Comino’s ASRock Rack GENOAD8X-2T/BCM motherboard provides seven x16 and one x8 PCIe Gen 5 slots, meaning seven of the eight GPUs run at full x16 bandwidth with the eighth card operating at x8. This is a trade-off between the number of PCIe lanes a single-socket CPU can support and Comino’s reluctance to add the size, cost, and complexity of a PCIe switch plate. Moving to a dual-socket motherboard would provide more PCIe lanes but offer even fewer slots, since the 2nd socket would occupy space otherwise used by PCIe slots in the space-constrained form factor.
Running eight GPUs in a single-socket system consumes the lion’s share of available PCIe lanes, and that comes with trade-offs. Our review unit, based on AMD Genoa, has 128 PCIe Gen 5 lanes available in total. With eight GPUs consuming 120 of those lanes, the remaining 8 lanes are split x4 to each M.2 SSD slot, so it is not possible to simultaneously run eight GPUs and a full complement of NVMe drives in the rear of the chassis connected via the two MCIO connectors. In our full 8-GPU configuration, only 2 M.2 slots were available for storage. Administrators who need additional NVMe capacity alongside maximum GPU density should be aware that adding rear hot-swap NVMe storage via the back-panel cages will consume additional PCIe lanes and disable some GPU capacity in their system.
ASRock Rack GENOAD8X-2T/BCM motherboard block diagram showing CPU, PCIe Gen 5 slots, DIMM channels, M.2 slots, BMC, USB, SATA, and networking connections.
With that said, storage is equally modular and expansive, though the configuration does affect the PCIe lane budget for GPUs, which is worth planning around for the intended use case. The rear panel of our review unit features a 2.5″ drive cage that supports up to four 2.5-inch SSDs in either 4x 7mm or 2x 15mm configurations, with an optional second set of up to four available in place of the fourth PSU slot. Because our review unit required all four power-supply bays to support the full 8-GPU configuration, we had access only to the first of the two hot-swap bays. Internally, the chassis can support a 3.5-inch cage that accommodates up to four 3.5-inch drives, four 2.5-inch 15mm drives, or up to twelve 2.5-inch 7mm drives, plus four additional internal 2.5-inch 7mm SSD slots if configured.
For networking, two onboard RJ45 10 Gb/s ports powered by the Broadcom BCM57416 are standard on the motherboard, alongside a dedicated Gigabit Ethernet IPMI management port. Administrators can further increase bandwidth by installing PCIe NICs that support up to 400 Gb/s for high-bandwidth fabric connectivity, though note that additional PCIe NICs occupy GPU slots, reducing the maximum number of GPUs the system can host.
Remote Management and System Intelligence
To safeguard the hardware and optimize performance, the system includes the Comino Monitoring System (CMS). A separate, autonomous controller board drives the CMS and serves as the server’s “brain,” independent of the main operating system. In practice, this controller reads a comprehensive array of sensors that track air and coolant temperatures, humidity levels, coolant flow rates, and reservoir levels in real time. Crucially, this autonomous design enables the CMS to perform self-diagnosis and trigger emergency shutdowns upon detecting a leak or a pump failure, protecting the expensive internal hardware from damage.
A web-based GUI handles day-to-day management, providing administrators with clear visibility into cooling performance, uptime, and real-time energy consumption for the CPU and GPUs. For enterprise-scale deployments, the CMS also connects to centralized monitoring tools via REST APIs, such as Zabbix, Grafana, and InfluxDB. Together, these capabilities help administrators maintain a 3-year interservice period and keep the server running at peak efficiency without thermal throttling, even in high-ambient environments.
Beyond AI: Creative and Engineering Applications
While our testing focused on AI inference workloads, the Grando serves an equally practical role for creative professionals and engineers who need substantial local GPU compute. The 768GB of aggregate VRAM across eight RTX PRO 6000 cards unlocks capabilities that conventional workstation configurations cannot match.
FX artists and motion graphics professionals can render complex scenes with massive texture sets entirely in VRAM, eliminating the disk-swapping bottlenecks that plague productions using 8K footage or high-polygon environments. CAD engineers running computational fluid dynamics or structural simulations can tackle assemblies of unprecedented complexity without partitioning their models into multiple runs. Video editors working with multi-stream 8K RAW timelines, colorists applying ML-based noise reduction at full resolution, and 3D artists rendering path-traced finals locally rather than waiting for cloud farm availability all benefit from this density of GPU memory and compute.
The Grando does not require a full eight-GPU configuration. Comino offers the platform in four-GPU, six-GPU, and eight-GPU configurations, with all variants available for immediate shipment. Smaller studios, independent creators, and engineering teams can right-size their investment to current needs while retaining a clear upgrade path as workloads grow.
Platform Trade-offs: Density vs. Expandability
The Grando’s compact design delivers exceptional GPU density and thermal management within a standard 4U footprint, but that density involves architectural trade-offs worth understanding before deployment.
The chassis accommodates motherboards with EATX and EEB form factors, but not extended server boards found in traditional dual-socket platforms. This limits the total number of PCIe lanes available for peripherals beyond the GPU array. In our eight-GPU configuration, the AMD EPYC processor’s 128 PCIe Gen 5 lanes are almost entirely consumed by the GPUs, leaving little bandwidth for additional NVMe storage or high-speed networking beyond the onboard 10GbE ports.
This contrasts with the eight-GPU platforms we have reviewed from Dell, HPE, and Supermicro. Those systems use larger chassis, dual-socket configurations, and PCIe switch topologies to support significantly more peripheral connectivity. They typically accommodate four to eight additional NICs or DPUs alongside the full GPU complement, plus eight or more hot-swap NVMe bays, making them well-suited for distributed inference workloads that require high-bandwidth fabric interconnects.
However, that expanded capability comes at a substantial cost. Power draws exceed 8kW. Thermal loads require dedicated data center cooling infrastructure. Noise floors preclude deployment outside purpose-built machine rooms. And lead times frequently stretch six to eighteen months due to persistent supply constraints on enterprise GPU platforms.
The Grando occupies a different position. For organizations that prioritize rapid deployment, manageable operating environments, and inference or creative workloads over large-scale distributed training, the trade-offs are often favorable. Teams that need their hardware now, in an environment they can actually work with, may find the Grando’s approach to density more practical than waiting in a queue for a platform they cannot realistically deploy once it arrives.
Comino Grando Performance Testing Results
System Configuration
- Chassis: Comino Grando
- Motherboard: ASRock Rack GENOAD8X-2T/BCM
- CPU: AMD EPYC 9474F 48C
- Memory: 512GB DDR5
- GPU: 8 x NVIDIA RTX PRO 6000
- Storage: M.2 SSD
Claude Code Serving – MiniMax M2.5
Beyond traditional raw LLM inference benchmarks, we wanted to evaluate how well this hardware performs in an agentic coding workflow, specifically by serving multiple concurrent Claude Code sessions using a locally hosted model. This use case maps directly to development team productivity: how many engineers can simultaneously use an AI coding assistant served from a single node before the experience degrades?
To test this, we built a benchmark harness that generates a dataset of moderately difficult coding problems (such as implementing an LRU cache, building a CLI todo application, writing a markdown converter, and constructing a REST API) and runs each Claude Code session in a separate Docker container against the local vLLM server. A transparent proxy sits between the sessions and the inference endpoint, capturing per-request metrics for each Claude Code instance. The model used was MiniMax M2.5, served via vLLM on the system’s eight NVIDIA RTX PRO 6000 GPUs. While not the top-ranked coding model on public leaderboards, M2.5 is a capable model that many users, including our developer friends, run locally.
For a baseline reference point, we use Anthropic’s Claude Opus 4.6 average output throughput via OpenRouter.ai, one of the most popular routing services for production API access. That baseline comes in at approximately 37 tokens per second per API request.
We measured two key metrics: the average output tokens per second per Claude Code session (what each developer experiences) and the aggregate output tokens per second across all sessions (the total work the server produces).
Based on the results, a single concurrent Claude Code session delivers 67.3 tok/s per user and an aggregate output of 64.7 tok/s. At two sessions, per-instance throughput drops modestly to 57.4 tok/s, while aggregate output climbs to 95.1 tok/s as vLLM’s batching begins to amortize overhead. Four concurrent sessions maintain 49.2 tok/s per user, still a highly responsive experience for interactive coding workflows, while aggregate throughput reaches 177.2 tok/s. Eight sessions represent the sweet spot for aggregate output, peaking at 206.7 tok/s total, while per-instance throughput settles at 38.7 tok/s, a level that remains comfortable for real-time code generation and iteration.
At 16 concurrent sessions, the system exhibits the classic batching trade-off: per-instance throughput drops to 31.1 tok/s, and aggregate output falls to 105.8 tok/s. This suggests that, at this concurrency level, the 230B MiniMax M2.5 model is pushing the limits of what eight GPUs can sustain without introducing meaningful latency for each user. The aggregate dip from 8 to 16 sessions reflects the memory-bandwidth demands of a large MoE architecture under heavy simultaneous decode load, rather than a scheduling inefficiency.
For organizations evaluating self-hosted AI infrastructure for developer tooling, the Grando makes a strong case. Running a frontier-class 230B model, it can comfortably serve up to eight simultaneous Claude Code sessions at throughput levels that feel genuinely interactive, with per-user speeds exceeding 38 tok/s at peak aggregate output. Teams of four to eight engineers can operate at near-optimal throughput without perceptible degradation in responsiveness.
The liquid-cooled architecture also makes this level of compute practical in environments where traditional GPU servers cannot operate. The system runs quietly enough to sit in a startup office, a small machine room, or a dedicated corner of an open workspace. Air-cooled systems with similar GPU density typically reach 90 dB or higher, which is loud enough to require dedicated data center space or, at a minimum, a closed server closet with serious acoustic treatment. The Grando can coexist with the team that uses it. Combined with full data locality, no per-token API costs, and complete control over model selection, it offers a self-hosted path that scales with a growing development team without requiring datacenter infrastructure or lockstep cost increases.
vLLM Online Serving – LLM Inference Performance
vLLM is one of the most popular high-throughput inference and serving engines for LLMs. The vLLM online serving benchmark evaluates the real-world serving performance of this inference engine under concurrent requests. It simulates production workloads by sending requests to a running vLLM server, with configurable parameters such as request rate, input and output lengths, and the number of concurrent clients. The benchmark measures key metrics, including throughput (tokens per second), time to first token, and time per output token (TPOT), helping users understand how vLLM performs under different load conditions.
We tested inference performance across a comprehensive suite of models spanning various architectures, parameter scales, and quantization strategies to evaluate throughput under different concurrency profiles.
Summary Of Results
| Model | Precision | Equal (256/256) | Prefill-Heavy (8k/1k) | Decode-Heavy (1k/8k) |
|---|---|---|---|---|
| Comino Grando w/ 8× RTX PRO 6000 Blackwell — vLLM Inference Results (tok/s, peak at BS=256) | ||||
| GPT-OSS 20B | ep_dp1 | 17,280 | 32,061 | 11,187 |
| GPT-OSS 120B | ep_dp1 | 11,726 | 21,636 | 7,570 |
| Llama 3.1 8B Instruct | FP8 | 12,109 | 20,137 | 7,353 |
| Llama 3.1 8B Instruct | FP4 | 11,954 | 20,206 | 7,239 |
| Llama 3.1 8B Instruct | BF16 | 11,752 | 17,346 | 6,155 |
| Qwen3 Coder 30B A3B | FP8 | 10,985 | 16,659 | 4,907 |
| Qwen3 Coder 30B A3B | BF16 | 10,588 | 16,680 | 4,829 |
| Mistral Small 3.1 24B | BF16 | 8,925 | 11,846 | 4,975 |
| MiniMax M2.5 (230B) | ep_dp1 | 5,753 | 7,357* | 2,555 |
| All values in tok/s, peak throughput at BS=256. *MiniMax M2.5 prefill-heavy peaked at BS=128 (7,357 tok/s); BS=256 was 7,141 tok/s. | ||||
GPT-OSS 120B and 20B
The GPT-OSS model family was tested in both 120B and 20B configurations on the Comino Grando.
GPT-OSS 120B
Under equal workload (256/256), the 120B model delivers 268.85 tok/s at BS=1, reaches 6,666.23 tok/s at BS=64, and peaks at 11,726.04 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 1,375.69 tok/s, climbs to 16,374.19 tok/s at BS=64 and 17,944.55 tok/s at BS=128, and peaks at 21,636.41 tok/s at BS=256. Decode-heavy (1k/8k) grows from 196.28 tok/s at BS=1 to 7,569.97 tok/s at BS=256, with latency well-controlled at lower concurrency levels.
GPT-OSS 20B
The 20B model delivers 334.80 tok/s at BS=1 under equal workload, reaches 10,303.56 tok/s at BS=64, and peaks at 17,280.12 tok/s at BS=256. Prefill-heavy starts at 2,007.90 tok/s, climbs to 24,990.46 tok/s at BS=64 and 26,866.25 tok/s at BS=128, peaking at 32,060.72 tok/s at BS=256, the highest absolute prefill throughput recorded across both model sizes. Decode-heavy grows from 286.08 tok/s at BS=1 to 11,187.36 tok/s at BS=256, delivering roughly 1.5× the decode throughput of the 120B at peak concurrency while maintaining tighter latency throughout.
Qwen3 Coder 30B A3B Instruct and FP8 Instruct
The Qwen3-Coder-30B-A3B-Instruct model was tested with both BF16 and FP8 precision.
Qwen3-Coder-30B-A3B-Instruct (BF16)
Under an equal workload (256/256), the BF16 model delivers 1,902.32 tok/s at BS=8, reaches 6,683.58 tok/s at BS=64, and peaks at 10,587.56 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 1,256.03 tok/s at BS=1, climbs to 14,400.57 tok/s at BS=64 and 15,308.35 tok/s at BS=128, and peaks at 16,679.52 tok/s at BS=256. Decode-heavy (1k/8k) grows from 169.19 tok/s at BS=1 to 4,828.82 tok/s at BS=256, with latency well-controlled at lower concurrency levels.
Qwen3-Coder-30B-A3B-Instruct (FP8)
The FP8 model delivers throughput comparable to BF16 across most scenarios, with equal workload reaching 6,478.54 tok/s at BS=64 and peaking at 10,984.61 tok/s at BS=256, a slight improvement over BF16 at peak concurrency. Prefill-heavy starts at 987.48 tok/s at BS=1, climbs to 14,036.46 tok/s at BS=64 and 15,156.69 tok/s at BS=128, and peaks at 16,658.98 tok/s at BS=256. Decode-heavy grows from 130.70 tok/s at BS=1 to 4,906.51 tok/s at BS=256, marginally outpacing BF16 at peak concurrency while the two configurations remain closely matched throughout the rest of the concurrency range.
Mistral Small 3.1 24B Instruct 2503
Under an equal workload (256/256), the model delivers 1,598.79 tok/s at BS=8, reaches 4,713.84 tok/s at BS=64, and scales strongly to 8,925.12 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 897.84 tok/s at BS=1, climbs to 9,632.58 tok/s at BS=64 and 11,488.13 tok/s at BS=128, peaking at 11,846.15 tok/s at BS=256. Decode-heavy (1k/8k) grows from 124.98 tok/s at BS=1 to 2,653.82 tok/s at BS=64, then accelerates noticeably at higher concurrency levels, reaching 4,262.53 tok/s at BS=128 and peaking at 4,975.06 tok/s at BS=256, reflecting the model’s ability to sustain strong decode throughput as concurrency scales.
Llama 3.1 8B Instruct
The Llama-3.1-8B-Instruct model was tested across three precision configurations on the Comino, providing a clear view of how quantization affects throughput for this model size.
Llama 3.1 8B Instruct BF16
Under an equal workload (256/256), the BF16 model delivers 2,776.42 tok/s at BS=8, reaches 7,369.01 tok/s at BS=64, and peaks at 11,751.56 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 1,645.29 tok/s at BS=1, climbs to 14,990.47 tok/s at BS=64 and 17,140.71 tok/s at BS=128, and peaks at 17,345.80 tok/s at BS=256. Decode-heavy (1k/8k) grows from 234.78 tok/s at BS=1 to 6,154.73 tok/s at BS=256.
Llama 3.1 8B Instruct FP8
FP8 quantization delivers a meaningful uplift across all scenarios. The equal workload reaches 7,530.39 tok/s at BS=64 and peaks at 12,108.98 tok/s at BS=256. Prefill-heavy climbs to 16,546.53 tok/s at BS=64 and 19,306.49 tok/s at BS=128, peaking at 20,137.35 tok/s at BS=256, roughly a 16% gain over BF16 at peak concurrency. Decode-heavy peaks at 7,353.40 tok/s at BS=256, approximately 19% ahead of BF16.
Llama 3.1 8B Instruct FP4
FP4 delivers throughput that is closely competitive with FP8 at higher concurrency levels, though it falls slightly behind at lower batch sizes. The equal workload peaks at 11,954.40 tok/s at BS=256, and prefill-heavy reaches its highest point at 20,205.57 tok/s at BS=256, narrowly edging out FP8 at peak concurrency. Decode-heavy peaks at 7,239.29 tok/s at BS=256, remaining within a few percent of FP8 throughout, making FP4 a compelling option when memory efficiency is a priority without a meaningful sacrifice in throughput.
MiniMax M2.5
The MiniMax-M2.5 230B, tested on the Comino Grando, was the largest and most demanding model we used.
Under an equal workload (256/256), the model starts at 16.35 tok/s at BS=1, reaches 2,751.25 tok/s at BS=64, and scales strongly at higher concurrency, peaking at 5,753.24 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 606.97 tok/s at BS=1, climbs steadily to 5,351.02 tok/s at BS=32 and 6,557.92 tok/s at BS=64, reaching its peak at 7,357.26 tok/s at BS=128 before slightly tapering to 7,140.74 tok/s at BS=256, suggesting the model approaches saturation in prefill throughput beyond BS=128. Decode-heavy (1k/8k) grows consistently from 82.21 tok/s at BS=1 to 1,485.28 tok/s at BS=64, peaking at 2,554.87 tok/s at BS=256, reflecting the expected memory bandwidth demands of a 230B MoE architecture under sustained decode workloads.
Conclusion
The Comino Grando is best understood as a system purpose-built to unlock the full potential of eight NVIDIA RTX PRO 6000 GPUs. Every major design decision, from the inverted motherboard layout to the cooling loop and integrated monitoring stack, is intended to ensure those GPUs can operate continuously at full 600W TDP without thermal or power constraints.
What makes the Grando compelling is not any single feature in isolation but the way the entire system coheres. The liquid cooling is not a bolt-on addition; it is the architecture. The power delivery is redundant, hot-swappable, and scaled to the 4,800W load of eight 600W cards with headroom to spare. The monitoring system goes beyond reporting temperatures; it autonomously protects the hardware when something goes wrong. Nothing here feels like an afterthought.
The performance numbers reinforce that cohesion. Across a diverse suite of models, from Llama 3.1 8B to the 230B MiniMax M2.5, the Grando delivered throughput figures that hold up well for a self-hosted platform. Claude Code concurrency testing put a finer point on the practical value: eight engineers can run simultaneous agentic coding sessions against a locally hosted 230B model at interactive speeds, with per-user throughput exceeding 38 tok/s at peak aggregate output. Teams of four to eight can operate at near-optimal throughput without perceptible degradation.
The value of this configuration extends beyond AI inference. With 96GB of VRAM per GPU and dense multi-GPU scaling, the platform is equally well suited for high-end creative and engineering workloads, including VFX rendering, large-scale simulation, and complex CAD pipelines. The system scales down to four-GPU and two-GPU configurations, making this level of performance accessible to smaller studios and teams that still require workstation-class density.
Where the Grando differs most from the enterprise eight-GPU platforms we have reviewed is in deployment practicality. Those systems offer more PCIe lane headroom, more NIC slots, and deeper storage connectivity, but they also require dedicated data center infrastructure, draw well over 8kW, and have lead times that can stretch beyond a year. The Grando trades some of that peripheral expandability for a system that runs quietly enough to share a room with its users, dissipates less heat into the surrounding environment, and ships now. For organizations that prioritize rapid deployment and manageable operating environments over maximum fabric connectivity, the trade-off is favorable.
Product Page – Comino Grando
Comino Configurator – Page
The post Comino Grando RTX PRO 6000 Review: 768GB of VRAM in a Liquid-Cooled 4U Chassis appeared first on StorageReview.com.
Feds will require data centers to show their power bills
From Earth to the Moon: How Phison Helped Power the First Lunar Data Center
Broadcom Extends VMware Tanzu Platform with Agent Foundations for Enterprise AI
At the AI in Finance Summit, Broadcom introduced VMware Tanzu Platform agent foundations, positioning it as a secure-by-default runtime for building and operating autonomous AI applications on VMware Cloud Foundation (VCF). The release extends Tanzu’s established code-to-production model to AI agents, targeting enterprise teams seeking to move from isolated AI experiments to governed, production-scale deployments.
Moving AI Agents into Enterprise Operations
As AI agents assume execution and decision-making roles, operational requirements shift toward governance, security, and integration with enterprise systems. Many organizations still run AI workloads in isolated environments that lack access to core data and standardized controls.
Tanzu Platform agent foundations address this gap by providing a pre-engineered platform-as-a-service layer for agent workloads, built directly on VCF. This enables platform engineering teams to manage AI services alongside traditional applications with familiar tooling and processes, without requiring deep specialization in AI infrastructure.
Deny-by-Default Agent Runtime
The agentic runtime introduces a set of controls to constrain agent behavior and reduce operational risk.
The software supply chain is managed using trusted Buildpacks rather than user-defined Dockerfiles. Containers are automatically built, patched, and verified, reducing exposure to embedded vulnerabilities or malicious components.
Secrets management is enforced at the structural level, preventing agents from accessing credentials outside their scope. This isolation is reinforced by VMware vDefend, which extends protections across infrastructure services and external SaaS integrations, limiting lateral movement.
Networking uses a zero-trust model. Agents operate within predefined resource and connectivity boundaries and have no default access to internal systems or models. Access is granted explicitly via secure service bindings, ensuring agents interact only with authorized data sources and services.
Developer Onboarding and Integrated Data Services
The platform includes pre-built agent templates to accelerate onboarding. Developers can provision agents with governed access to models, Model Context Protocol servers, and curated marketplace services defined by IT.
Data services are integrated into the platform, including Tanzu for Postgres with pgvector, as well as caching, streaming, and data flow services. Support for Spring AI memory services enables stateful agent behavior that aligns with enterprise application patterns.
Operational Scaling on VMware Cloud Foundation
Tanzu Platform agent foundations integrate with VCF infrastructure APIs to abstract away resource provisioning and lifecycle management. This ensures that agent workloads and their dependencies receive the required compute, storage, and networking resources without direct interaction with the infrastructure.
Elastic scaling allows environments to scale up or down based on workload demand, supporting both short-lived and persistent agents while optimizing cost and utilization.
High availability is achieved through multiple layers of redundancy and automated remediation. The platform continuously monitors and self-heals the underlying infrastructure to maintain service continuity for mission-critical autonomous applications.
An integrated AI gateway provides centralized control of model and tool access. It manages availability, usage policies, cost controls, and safety filtering for both public and private models on VCF.
According to Purnima Padmanabhan, General Manager of the Tanzu Division at Broadcom, the rapid agentic application development is driving collaboration with customers to accelerate innovation. She highlights that Tanzu Platform agent foundations enable the rapid deployment of agentic ideas on modern private clouds, specifically using VMware Cloud Foundation 9.
With agent foundations, Broadcom is aligning Tanzu Platform with emerging enterprise AI requirements, with a focus on governance, security, and operational consistency. The approach builds on existing VMware infrastructure investments and introduces a standardized runtime for agent-based applications, making AI deployment more predictable and manageable at scale.
The post Broadcom Extends VMware Tanzu Platform with Agent Foundations for Enterprise AI appeared first on StorageReview.com.
Wasabi Technologies to Acquire Seagate Lyve Cloud Business
Wasabi Technologies has reached a definitive agreement to acquire the Lyve Cloud business from Seagate Technology. As part of the transaction, Seagate will receive an equity stake in Wasabi, officially becoming a shareholder. While specific financial details remain undisclosed, the move marks a significant consolidation in the pure-play cloud storage market.
David Friend, co-founder and CEO of Wasabi, noted that the acquisition bolsters the company’s position as a leader in independent cloud storage. The integration of Lyve Cloud brings a dedicated enterprise customer base into Wasabi’s ecosystem. These customers will transition to Wasabi’s global data center infrastructure, which features specialized security tools such as Covert Copy and integrated AI capabilities. The provider intends to maintain high levels of technical support and partner integration for the incoming Lyve Cloud users.
For Seagate, the divestiture serves a specific strategic purpose. Gianluca Romano, Seagate’s CFO, indicated that the sale allows the company to refocus resources on its core mass-capacity storage hardware business. As the demand for high-capacity drives continues to climb, Seagate aims to prioritize manufacturing and innovation in hard drive technology. By transitioning the cloud service to Wasabi, Seagate ensures that a specialized provider services its existing cloud customers while the manufacturer maintains an indirect interest through its new equity position.
Engineering for Enterprise Scale
The proliferation of AI initiatives, large-scale analytics, and extensive video workloads is currently driving demand for enterprise-grade storage. As organizations manage data volumes reaching the petabyte scale, the total cost of ownership and vendor complexity become critical factors in infrastructure design. Many firms are moving away from traditional hyperscalers in favor of providers that offer predictable pricing models and robust security without the egress fees often associated with legacy cloud platforms.
Lyve Cloud established itself as a viable enterprise platform by prioritizing compliance and security features. By merging these assets with Wasabi’s established channel reach and execution strategy, the combined entity provides a streamlined alternative for professional IT environments. The acquisition aims to deliver consistent performance at scale while addressing the economic challenges of long-term data retention.
Ecosystem Integration and Data Protection
The consolidation of these two platforms simplifies the data protection and backup landscape for administrators. Both Wasabi and Lyve Cloud maintain deep integrations and certifications with leading backup software providers, including Veeam, Rubrik, and Commvault. This overlap ensures that existing automated workflows and S3-compatible API calls remain functional during and after the transition.
For channel partners and system integrators, the acquisition reduces the overhead of managing multiple independent S3-compatible storage vendors. By unifying the service under a single banner, Wasabi enhances its ability to support mission-critical backup and recovery workloads. This move strengthens the broad ecosystem of independent storage solutions, providing technical teams with a reliable, cost-effective target for enterprise data offloading.
The post Wasabi Technologies to Acquire Seagate Lyve Cloud Business appeared first on StorageReview.com.
-
TechCrunch
- AI data center startup Fluidstack in talks for $1B round at $18B valuation months after hitting $7.5B, says report
AI data center startup Fluidstack in talks for $1B round at $18B valuation months after hitting $7.5B, says report
Supermicro Unveils Three New Edge AI Systems Built on AMD EPYC 4005
Supermicro has introduced three new compact edge computing systems based on AMD’s EPYC 4005 series processors, expanding its push into AI workloads beyond the traditional data center. The new lineup includes the AS-E300-14GR, AS-1116R-FN4, and AS-3015TR-i4.
The systems are designed for deployments where space is tight, power is limited, and dedicated IT support may not be readily available. That makes them a good fit for settings such as retail stores, manufacturing sites, healthcare environments, and branch offices, where companies increasingly process data locally rather than sending it back to centralized infrastructure.
Supermicro says the systems are built for real-time inference and other business-critical edge workloads, with a focus on keeping power consumption and operating costs in check. Use cases include loss prevention, frictionless checkout, and in-store analytics, as well as other applications that rely on fast on-site processing.
All three systems include security features that are becoming standard requirements in edge and distributed IT environments. They support TPM 2.0 and AMD Secure Encrypted Virtualization (SEV) to help protect workloads and data. Those features are paired with IPMI 2.0 remote management, which simplifies monitoring and administration for systems deployed far from centralized IT teams.
The platforms also include four GbE ports, enabling connections to point-of-sale infrastructure, cameras, and enterprise networks. This capability is important in edge environments where a single system may need to interface with multiple devices and applications simultaneously, particularly in retail and industrial settings.
At the processor level, the new systems are built around AMD’s EPYC 4005 series, based on the company’s Zen 5 architecture. The chips support DDR5 memory and PCIe Gen 5, with TDPs starting at 65W. Some models also feature AMD’s 3D V-Cache, which can boost performance in data-intensive workloads by improving access to frequently used data.
Supermicro AS-E300-14GR
First is the AS-E300-14GR, a compact 1U mini box system housed in a 2.5-liter enclosure. It supports up to 16-core processors and up to 192GB of DDR5 memory, and is designed for embedded or space-constrained environments. Supermicro said it is suited to point-of-sale applications via HDMI and MiniDisplay connectivity, as well as network gateway roles. It includes a dedicated out-of-band management port alongside four GbE ports.
The AS-E300-14GR is a Mini-1U embedded system that supports up to 192GB of DDR5-5600 memory across four DIMM slots, a substantial amount for a compact edge box.
On the storage side, it includes one internal 2.5-inch SATA bay, two M.2 PCIe 5.0 x4 NVMe slots, and a low-profile PCIe 5.0 x16 slot for expansion or accelerator support. Connectivity is a strong point, with four 1GbE LAN ports, a dedicated BMC management port, rear USB 3.2 ports, HDMI 2.1, and Mini-DP. All of that fits inside a fan-based embedded chassis measuring just 264.8 x 43 x 225.8mm, making it a practical option for edge deployments where space is limited but performance, networking, and remote management still matter.
Supermicro AS-E300-14GR Specifications
| Specification | AS-E300-14GR |
|---|---|
| Overview | |
| Model | IoT SuperServer AS -E300-14GR |
| System Type | Mini-1U embedded system with AMD EPYC |
| Key Applications | Healthcare, Surveillance Security Server, AI Inference, Digital Signage / PoS, |
| Form Factor | Fan-based Embedded |
| Chassis | CSE-E300 |
| Motherboard | Super H14SRV-HLN4F |
| Processor and Memory | |
| Processor | Single Socket AM5 (LGA-1718) AMD EPYC 16C/32T; 64MB Cache |
| System Memory | Slot Count: 4 DIMM slots Max Memory (1DPC): 192GB 5600MT/s ECC/non-ECC DDR5 UDIMM |
| Storage and Expansion | |
| Drive Bays Configuration | Default: Total 1 bay 1 internal fixed 2.5″ SATA* drive bay (*SATA support may require additional storage controller and/or cables) |
| M.2 | 1 M.2 PCIe 5.0 x4 NVMe slot (M-key 2280) 1 M.2 PCIe 5.0 x4 NVMe slot (M-key 22110) |
| Expansion Slots | Default* 1 PCIe 5.0 x16 (in x16) LP slot (*Requires additional parts, please see the optional parts list for details. For more details on PCIe slot configuration options, please refer to the system callout images above.) |
| Networking and I/O | |
| LAN | 4 RJ45 1 GbE LAN ports (Intel I350-AM4) 1 RJ45 1 GbE Dedicated BMC LAN port (ASPEED AST2600) |
| USB | 2 USB 3.2 Gen2 Type-A ports(Rear) 1 USB 3.2 Gen1 Type-A port(Rear) 1 USB 3.2 Gen1 Type-C port(Rear) |
| Video | 1 HDMI 2.1 port(Rear) 1 Mini-DP port(Rear) |
| TPM | 1 TPM header 1 TPMOnboardd/port 80Onboard |
| d Devices | AMD B650 |
| Power, Cooling, and Management | |
| System Cooling | Fans: Up to 1 CPU heatsink with 70x70x15mm Fan(s) Up to 2x 4-PIN PWM 40x40x28mm Fan(s) |
| Power Supply | 1 x 180W power supply |
| System BIOS | BIOS Type: AMI 32MB UEFI BIOS Features: ACPI 6.5 SMBIOS 3.7 or later UEFI 2.9 |
| Management | SuperCloud Composer; Supermicro Server Manager (SSM); Super Diagnostics Offline (SDO); Supermicro Thin-Agent Service (TAS); SuperServer Automation Assistant (SAA) New!; Plug-ins for 3rd Party Software |
| PC Health Monitoring | FAN: Status monitor for speed control |
| Physical and Environmental | |
| Enclosure | 264.8 x 43 x 225.8mm (10.43″ x 1.69″ x 8.89″) |
| Package | 381 x 276 x 142mm (15″ x 10.87″ x 5.59″) |
| Weight | Gross Weight: 7.5 lbs (3.4 kg) Net Weight: 3.7 lbs (1.6 kg) |
| Available Color | Black |
| Operating Environment | Operating Temperature: 0°C to 40°C (32°F to 104°F) with 0.7 m/s airflow Non-operating Temperature: -40°C to 70°C (-40°F to 158°F) Operating Relative Humidity: 8% to 90% (non-condensing) Non-operating Relative Humidity: 5% to 95% (non-condensing) |
Supermicro AS-1116R-FN4
The AS-1116R-FN4 is a compact 1U rackmount system designed for installations where storage density and rack efficiency are priorities. It is geared toward branch offices and retail back-end consolidation, where organizations may want to consolidate multiple workloads into a smaller physical footprint.
The AS-1116R-FN4 takes a more rack-focused approach while keeping the footprint compact. It is also a Mini-1U system with a 249mm short-depth chassis and support for up to 192GB of DDR5-5600 memory. Storage is more flexible than in the smaller box system, with support for either two internal 2.5-inch NVMe bays or one internal 3.5-inch SATA bay, plus two M.2 PCIe 5.0 slots.
It also includes a low-profile PCIe 5.0 x16 expansion slot, four 1GbE LAN ports, a dedicated BMC management port, rear USB connectivity, HDMI 2.1, and Mini-DP. With a 200W Gold power supply, three counter-rotating fans, and remote management support, it is well-suited for branch, retail back-end, and other edge deployments that require server-class features in a very compact rackmount chassis.
Supermicro AS-1116R-FN4 Specifications
| Specification | AS-1116R-FN4 |
|---|---|
| Overview | |
| Model | IoT SuperServer AS -1116R-FN4 |
| System Type | H14 1U Ultra-short depth 249mm chassis with AMD EPYC 4005/4004 65W server |
| Key Applications | AI Inference and Machine Learning, Cloud Computing, Healthcare, Surveillance Security Server |
| Form Factor | Mini-1U |
| Chassis | CSE-505-203B |
| Motherboard | Super H14SRV-HLN4F |
| Processor and Memory | |
| Processor | Single Socket AM5 (LGA-1718) AMD EPYC 4005/4004 Series Processor 16C/32T; 64MB Cache |
| System Memory | Slot Count: 2 DIMM slots Max Memory (1DPC): 192GB 5600MT/s ECC/non-ECC DDR5 UDIMM |
| Storage and Expansion | |
| Drive Bays Configuration | Default: Total 2 bays 2 internal fixed 2.5″ NVMe drive bays Option A: Total 1 bay 1 internal fixed 3.5″ SATA drive bay |
| M.2 | 1 M.2 PCIe 5.0 x4 slot (M-Key 2280) 1 M.2 PCIe 5.0 x4 NVMe slot (M-key 22110) |
| Expansion Slots | 1 PCIe 5.0 x16 (in x16) LP slot |
| Networking and I/O | |
| LAN | 4 RJ45 1 GbE LAN ports (Intel I350-AM4) 1 RJ45 1 GbE Dedicated BMC LAN port (ASPEED AST2600) |
| USB | 1 USB 3.2 Gen2 Type-C port(Rear) 2 USB 3.2 Gen2 Type-A ports(Rear) 1 USB 3.0 Gen2 Type-A port(Rear) |
| Video | 1 HDMI 2.1 port(Rear) 1 Mini-DP port(Rear) |
| TPM | 1 TPM header 1 TPOnboardrd/port 8Onboard |
| rd Devices | AMD B650 |
| Power, Cooling, and Management | |
| System Cooling | Fans: 3 counter-rotating 40x40x28mm Fan(s) Air Shroud: 1 Air Shroud |
| Power Supply | 1x 200W Gold Level (91%) power supply |
| System BIOS | BIOS Type: AMI 32MB UEFI BIOS Features: ACPI 6.5 SMBIOS 3.7 or later UEFI 2.9 |
| Management | SuperCloud Composer; Supermicro Server Manager (SSM); Super Diagnostics Offline (SDO); Supermicro Thin-Agent Service (TAS); SuperServer Automation Assistant (SAA) New!; Plug-ins for 3rd Party Software |
| PC Health Monitoring | FAN: Status monitor for speed control |
| Physical and Environmental | |
| Enclosure | 437 x 43 x 249mm (17.2″ x 1.7″ x 9.8″) |
| Package | 655 x 155 x 465mm (25.8″ x 6.1″ x 18.3″) |
| Weight | Gross Weight: 10 lbs (4.54 kg) |
| Available Color | Black |
| Operating Environment | Operating Temperature: 0°C to 40°C (32°F to 104°F) Non-operating Temperature: -40°C to 70°C (-40°F to 158°F) Operating Relative Humidity: 8% to 90% (non-condensing) Non-operating Relative Humidity: 5% to 95% (non-condensing) |
Supermicro AS-3015TR-i4
Lastly, the AS-3015TR-i4 is a slim tower system designed for quieter environments and for easier installation in edge locations without dedicated server rooms. The data sheet was unavailable; however, the tower can accommodate a dual-slot GPU measuring 2.7 inches high by 6.6 inches long, including support for a dual-slot GPU such as the NVIDIA RTX PRO 2000 Blackwell. The 9-liter chassis also includes options for a slim optical drive and a 3.5-inch disk drive, providing additional flexibility for edge deployments that still require local media or storage.
Supermicro AMD EPYC 4000 Systems
The post Supermicro Unveils Three New Edge AI Systems Built on AMD EPYC 4005 appeared first on StorageReview.com.
IBM pays $17M fine to end DOJ suit over DEI programs
Microsoft is working on yet another OpenClaw-like agent
Ubiquiti UniFi G6 Turret Review: 4K PoE Camera with On-Device AI for $199
Ubiquiti’s G6 Turret is a 4K PoE camera with a turret design, featuring on-device face and license plate recognition and full UniFi Protect integration, all at a $199 price point. The turret design sets it apart from traditional domes by placing the lens module in a ball-and-socket housing. You can physically adjust the module on three axes after mounting, giving installers direct control over framing without being locked into the bracket’s angle. For jobs involving a specific entry lane, a retail counter, or a tight corridor, this hands-on adjustability considerably speeds up installation.
Hardware Overview
The G6 Turret has a 1/1.8″ 8MP sensor and a quad-core processor powered by a Multi-TOPS AI Engine. In addition to local face and license plate recognition, this small camera offers 30-meter IR night vision and connects to UniFi Protect over standard PoE, without requiring PoE+.
The IK04 rating makes this camera better suited to controlled commercial spaces than high-exposure public areas. As a result, it belongs in offices, retail shops, or covered entrances rather than unmonitored street-side mounts, where frequent physical tampering isn’t expected.
| Specification | Ubiquiti UniFi G6 Turret |
|---|---|
| General | |
| Dimensions | ⌀100 × 95 mm (⌀3.9 × 3.7″) |
| Weight | 550 g (1.2 lb) |
| Enclosure Material | Aluminum alloy, polycarbonate |
| Weatherproofing | IP66 |
| Tamper Resistance | IK04 |
| Ambient Operating Temperature | -30 to 50°C (-22 to 122°F) |
| Ambient Operating Humidity | 0 to 90% noncondensing |
| Button | (1) Factory reset |
| Video | |
| Resolution | 4K 8MP 3864 × 2160 (16:9) |
| Max. Frame Rate | 30 FPS |
| Image Settings | Color, brightness, sharpness, contrast, white balance, exposure control, 2DNR, 3DNR, NR by motion, masking, text overlay, HDR |
| Optics | |
| Sensor | 1/1.8″ 8MP |
| Lens | Fixed focal length |
| Field of View | H: 109.9°, V: 56.7°, D: 134.1° |
| Night Mode | Built-in IR LED illumination and IR cut filter |
| IR Night Vision Range | 30 m (98 ft) |
| Intelligence | |
| Face Recognition | ✓ |
| License Plate Recognition | ✓ |
| Smart Detections | People, Vehicles, Animals |
| Audio | |
| Audio | Microphone |
| Hardware | |
| Processor | Quad-core ARM Cortex-A53-based chip |
| Power | |
| Power Method | PoE |
| Supported Voltage Range | 37 – 57V DC |
| Max. Power Consumption | 12.5W |
| Networking | |
| Networking Interface | 10/100 MbE RJ45 port |
| UniFi Application Suite | Protect |
| Cable | |
| Cable Connector Type | RJ45 |
| Cable Diameter | 4.5 mm (0.2″) |
| Cable Length | 30 cm (1 ft) |
| Jacket Material | Thermoplastic elastomer |
| Jacket Enclosure Dimensions | ⌀20 × 70.6 mm (0.8 × 2.8″) |
| Jacket Enclosure Material | Thermoplastic elastomer, polycarbonate, silicone rubber |
| Mounting | |
| Included Mounting | Ceiling, Wall |
| Optional Mounting | Arm, Pendant, Junction Box |
Design and Build
The turret form factor works differently from a dome. Rather than positioning a fixed lens behind a polycarbonate cover, the G6 Turret places its lens module in an exposed ball-and-socket housing that rotates freely until you tighten it down. Three-axis adjustment allows independent pan, tilt, and rotation, which is particularly useful on wall mounts, where a ceiling-only mount angle would otherwise require repositioning the entire bracket. Only a screwdriver is needed for adjustments, so framing the shot on-site is quick.
The camera measures ⌀100 × 95 mm and weighs 550 g (1.2 lb). Build quality is solid throughout, with an aluminum alloy and polycarbonate construction that matches the broader G6 lineup. The white finish blends cleanly against standard commercial ceilings, though the exposed ball joint makes this camera more visible than a low-profile dome. If a discreet install is a priority, a recessed dome is the better choice.
IP66 weatherproofing allows for outdoor use without a cover, so it handles car parks, entry canopies, and similar positions without issue. The IK04 rating covers standard commercial use cases but isn’t suited to high-impact or high-interference locations. The operating temperature range runs from -30 to 50°C (-22 to 122°F), so cold climates aren’t a concern either.
Optics and AI
The 1/1.8″ 8MP sensor records 4K at 30 FPS with a full image settings suite including HDR, 2DNR, 3DNR, masking, and text overlay. The field of view spans 109.9 degrees horizontal and 134.1 degrees diagonal, which is wide enough to cover most fixed camera positions without needing to zoom in on subjects. Built-in IR LED illumination handles night operation out to 30 meters (98 ft), and an IR cut filter switches automatically at dusk.
On-device AI runs via the quad-core Arm Cortex-A53 and the Multi-TOPS AI Engine. Face and license plate recognition process locally at the camera level rather than waiting on the NVR, which keeps alert latency low and reduces load on the host recorder. Smart detection monitors people, vehicles, and animals and works with Protect’s configurable zone masking to deliver targeted alerts.
The fixed focal-length lens consistently covers the full field of view without barrel distortion, so identification accuracy remains high. Physical three-axis adjustment handles positioning, and once you tighten the ball joint, the framing holds reliably.
Management and Installation
The G6 Turret operates on standard PoE with a maximum power draw of 12.5W, which stays within the 15.4W limit of 802.3af, eliminating the need for PoE+ switching. Even so, you should account for the draw when budgeting power across a dense switch. The included 30 cm (1 ft) pigtail features an RJ45 connector with a thermoplastic elastomer jacket that seals the connection cleanly at the camera body. Protect detects the camera immediately upon first power-up and then guides you through setup.
Once you adopt the camera, the UniFi Protect dashboard provides centralized management for connection status, image tuning, and recording settings. Adjusting the framing is straightforward and takes about a minute. Loosen the collar, rotate the lens, and retighten, which is considerably faster than repositioning a fixed dome’s backplate.
Before deploying the G6 Turret, a few practical details are worth noting. First, Protect lets you configure motion and smart detection masks independently, so you can exclude footpaths or busy roads from triggering alerts without turning off detection across the entire frame. Second, the G6 Turret has no MicroSD slot and therefore won’t record locally if the NVR connection drops. Finally, ceiling and wall mounts come in the box, and arm, pendant, and junction box mounts are available separately for non-standard implementations.
Conclusion
Overall, the G6 Turret delivers 4K at 30 FPS, on-device face and license plate recognition, and 30-meter IR, all for $199. The three-axis manual adjustment is a genuine practical advantage in the field, especially on wall mounts, where fixed cameras require more bracket work. Additionally, on-device AI processing via the Multi-TOPS engine keeps detection fast and reduces NVR load.
That said, the IK04 rating and the absence of local storage are worth confirming against your deployment requirements before you commit. For controlled commercial spaces, retail, and offices, those limitations rarely matter. Overall, the G6 Turret is a well-specified camera that integrates cleanly into any UniFi Protect system.
Product Page – G6 Turret
The post Ubiquiti UniFi G6 Turret Review: 4K PoE Camera with On-Device AI for $199 appeared first on StorageReview.com.
Supermicro JumpStart Review: H14 with AMD Instinct MI350X
Supermicro’s JumpStart program has established itself as one of the more useful tools in the pre-purchase evaluation toolkit for AI infrastructure. Rather than a scripted demo in a shared environment, JumpStart gives qualified users free, time-boxed, bare-metal access to real production servers via SSH, IPMI, and VNC, enabling them to run workloads on actual hardware. We covered the program in depth last November using an X14 system with an NVIDIA HGX B200, and came away with a clear picture of what a week of focused access can and cannot tell you. This time, Supermicro provided access to an H14 8U system with a very different accelerator story.
We tested the AS-8126GS-TNMR system, an 8U air-cooled platform built around dual AMD EPYC 9575F processors and eight AMD Instinct MI350X GPUs. The MI350X is AMD’s current flagship data center accelerator, built on the 4th Gen CDNA architecture at TSMC’s 3nm node and featuring 288GB of HBM3e per GPU. Across eight GPUs interconnected via AMD Infinity Fabric, the server offers 2.3 TB of total GPU memory in a single node, with an aggregate bandwidth of 1,024 GB/s. The full system uses six 5,250W Titanium-level power supplies in a 3+3 redundant configuration, and Supermicro has provisioned dedicated 400 Gbps networking per GPU for scale-out deployments.
AMD’s position in the data center GPU market has shifted meaningfully in the past two years, and the MI350X generation represents a more serious competitive challenge to NVIDIA than any prior Instinct product. ROCm 7, released in September 2025 and now at version 7.2, brought native MI350X support alongside dramatically improved inference performance, HIP API updates that close the CUDA compatibility gap, and broadened framework support, including PyTorch, JAX, TensorFlow, ONNX Runtime, vLLM, and SGLang.
The vLLM project added a dedicated AMD ROCm CI pipeline in late December 2025, making AMD hardware a first-class platform in that inference stack rather than a downstream port. The ecosystem’s adoption is also hard to ignore: AMD and Meta announced a multi-year, multi-generation 6-gigawatt GPU deployment agreement in February 2026, building on Meta’s existing production deployments of MI300 and MI350 series hardware. That level of commitment from one of the world’s largest AI infrastructure operators is no marketing footnote.
For organizations currently evaluating AI accelerator infrastructure, the lead time for NVIDIA hardware remains a concern. The question is whether AMD is a credible alternative rather than a fallback. Based on a week of testing with ROCm 7.2.0 and the current vLLM, the answer is meaningfully different from what it was 18 months ago.
Our testing covered a selection of popular models; the 2.3TB of HBM3e across a single node enabled single-server inference on large-parameter models, including Moonshot’s Kimi K2.5 and MiniMax M2.5.
AMD Instinct MI350X: Architecture and Generational Improvements
The MI350X represents AMD’s most architecturally ambitious generational leap in the Instinct product line to date. Understanding the engineering decisions behind it provides important context for interpreting the subsequent performance results.
CDNA 4 Architecture and Process Node Transition
The foundational shift from the MI300 series to the MI350 series centers on adopting TSMC’s N3P process node for the Accelerator Compute Chiplets (XCDs), moving from the 5nm fabrication used in the prior generation. The total transistor count reaches approximately 185 billion, a roughly 21% increase over the MI300 generation, achieved without a corresponding increase in power consumption.
The MI350X retains AMD’s proven multi-chiplet packaging strategy. At its core, the GPU package features eight Accelerator Compute Chiplets (XCDs) as the primary computational engines. Each XCD houses four shader engines, each with eight active CDNA 4 compute units, yielding 32 CUs per XCD and a total of 256 CUs for the full accelerator.
The I/O Die layer was also consolidated from four tiles to two in the CDNA 4 package design. This reorganization enabled AMD to double the Infinity Fabric bus width, improving bi-sectional bandwidth while lowering the bus frequency and operating voltage to reduce power consumption.
Redesigned Compute Units and Expanded Precision Support
With the CDNA 4 compute unit matrix math capabilities, there is a substantial boost: MI350 CUs deliver a 2x increase in throughput per CU for 16-bit (BF16, FP16) and 8-bit (FP8, INT8) operations compared to their MI300 counterparts.
Beyond raw throughput gains, CDNA 4 introduces hardware support for lower-precision data types absent from the MI300 series, specifically FP6 and FP4, alongside the existing FP8 support carried forward from the prior generation.
In addition to these standard formats, the MI350X adds native hardware support for the OCP microscaling variants: MXFP4, MXFP6, and MXFP8. Microscaling formats are designed to deliver the throughput advantages of lower-precision compute while maintaining output quality closer to higher-precision baselines than standard quantization typically allows. This is not an AMD-specific development. NVIDIA’s NVFP4 format operates on the same microscaling principles and has seen broad adoption across frontier model deployments, with the GPT-OSS family from OpenAI as one of the most prominent examples built around these formats. The MI350X’s native MXFP4 support allows it to serve these and similar quantized model families without falling back to software emulation or precision promotion.
The MI350X delivers 9.2 PFLOPs at MXFP4 and MXFP6, compared with 4.6 PFLOPs at OCP-FP8, with FP16 at 2.3 PFLOPs and a peak engine clock of 2,200 MHz. For inference-optimized deployments where microscaling quantization is viable, the compute headroom effectively doubles relative to FP8 workloads. A new vector ALU has also been added to the CDNA 4 compute unit, supporting 2-bit operations and capable of accumulating BF16 results into FP32, providing additional flexibility for low-precision vector workloads outside the primary matrix compute path.
Memory Subsystem: HBM3e, Infinity Cache, and Bandwidth Efficiency
The MI350 series features a substantially upgraded memory subsystem with eight HBM3e memory stacks, providing a total capacity of 288GB per GPU. Each 36GB stack, composed of 12-high 24Gbit devices, operates at the full HBM3e pin speed of 8Gbps per pin. The architecture retains AMD’s Infinity Cache, a memory-side cache positioned between the HBM and the Infinity Fabric/L2 caches. It comprises 128 channels, each backed by 2 MB of cache, for a total of 256 MB per GPU. AMD has widened the on-die network buses within the IODs and operates them at a reduced voltage, enabling approximately 1.3x higher memory bandwidth per watt compared to the MI300 series.
The increase in memory capacity from the MI300X’s 192GB to 288GB extends AMD’s lead in per-GPU memory headroom, with direct implications for large-model inference. Each MI350X GPU can independently host models with more than 500 billion parameters. Across an eight-GPU server, the aggregate 2.3TB of HBM3e eliminates the multi-node distribution requirements that complicate trillion-parameter deployments, as the Kimi K2.5 and MiniMax M2.5 results in this review demonstrate.
Flexible Partitioning and Deployment Architecture
The MI350 series supports flexible GPU partitioning per socket, with memory split into two separate clusters. This flexibility also applies to the XCDs, where the quad XCD cluster can be split into dual or single blocks, enabling the chip to support configurations such as 8 instances of 70B models in CPX+NPS2. For organizations running heterogeneous inference workloads across shared infrastructure, this partitioning capability reduces the need for dedicated hardware per model tier and improves utilization economics across mixed deployment environments.
The MI350 series also maintains drop-in compatibility with the UBB (Universal Base Board) infrastructure used in MI300 Series systems. Existing server chassis, power delivery, and cooling infrastructure carry forward without modification, reducing upgrade friction for organizations with active MI300 deployments.
MI355X: The Liquid-Cooled Sibling
The MI350 series ships in two variants built on identical underlying silicon and optimized for different thermal operating envelopes. The MI350X tested here is the air-cooled variant, while the MI355X is its liquid-cooled counterpart, designed for higher-density deployments where direct liquid-cooling infrastructure is available.
While both variants are built on the same fundamental hardware, the MI355X’s higher operational power envelope enables higher sustained clock frequencies, resulting in an approximate 20% performance advantage in real-world, end-to-end workloads compared to the MI350X. The MI355X carries a TBP ceiling of 1,400W versus the MI350X’s 1,000W, with clock speed topping out at 2.4 GHz compared to 2.2 GHz on the air-cooled variant.
In generational terms, the MI355X platform delivers up to 4x peak theoretical performance improvement over the MI300X, with real-world inference gains of approximately 4.2x in agentic and chatbot workloads and about 3x in content generation scenarios. For organizations evaluating MI350X deployments, the 20% performance differential between the two variants represents a clear ceiling. Facilities with DLC infrastructure should evaluate the MI355X to determine whether the thermal investment yields sufficient throughput uplift for their specific workload profile before committing to air-cooled configurations at scale.
Accessing the AMD Instinct MI350X via Supermicro JumpStart Program
Getting started with JumpStart requires registering on Supermicro’s portal, where qualified users can browse available systems and schedule a reservation window. Once approved, the portal provides SSH credentials, IPMI access, and a web-based remote console for the duration of the booking. The system arrives preinstalled with Ubuntu and ready to use. There is no provisioning delay and no support interaction required to get started. Our reservation ran from March 23 through March 27, 2026, giving us a full week on the platform, consistent with our prior JumpStart engagement on the HGX B200.
The screenshot below shows our terminal output from jumpstart for the H14 system, with the AMD-SMI tool displaying the eight AMD Instinct MI350X GPUs and their running software versions.
AMD Instinct MI350X Performance Testing Results
System Configuration
- Chassis: Supermicro H14
- CPU: Dual AMD EPYC 9575F
- Memory: 3TB DDR5
- GPU: eight AMD Instinct MI350X
- Storage: 2x 3.8TB PCIe 4.0 M.2 NVMe SSD and 1 x 1.92TB NVMe M.2
Summary of Results
| Model | Precision | Equal (256/256) | Prefill-Heavy (8k/1k) | Decode-Heavy (1k/8k) |
|---|---|---|---|---|
| GPT-OSS 20B | NVFP4 | 62,247 | 123,714 | 32,468 |
| GPT-OSS 120B | NVFP4 | 33,538 | 84,018 | 20,602 |
| Llama 3.1 8B Instruct | BF16 | 51,467 | 77,658* | 19,326 |
| Mistral Small 3.1 24B | FP8 | 40,742 | 56,093 | 14,557 |
| Mistral Small 3.1 24B | BF16 | 30,530 | 53,740 | 13,559 |
| Qwen3 Coder 30B A3B | BF16 | 34,980 | 51,550 | 11,782 |
| Qwen3 Coder 30B A3B | FP8 | 25,928 | 47,179 | 11,014 |
| MiniMax M2.5 | Block-Scaled FP8 | 14,391 | 23,689 | 6,068 |
| Kimi K2.5 | INT4 QAT + BF16 | 6,527 | 11,256 | 2,513 |
| All values in tok/s, peak throughput at BS=256. *Llama 3.1 8B prefill-heavy peaked at BS=128 (77,658 tok/s); BS=256 was 76,893 tok/s. | ||||
Claude Code Serving – MiniMax M2.5
Beyond traditional raw LLM inference benchmarks, we wanted to evaluate how well this hardware performs in an agentic coding workflow, specifically serving multiple concurrent Claude Code sessions with a locally hosted model. This use case maps directly to development team productivity: how many engineers can simultaneously use an AI coding assistant served from a single node before the experience degrades?
To test this, we built a benchmark harness that generates a dataset of moderately difficult coding problems (tasks like implementing an LRU cache, building a CLI todo application, writing a markdown converter, and constructing a REST API) and runs each Claude Code session in its own Docker container against the local vLLM server. A transparent proxy sits between the sessions and the inference endpoint, capturing per-request metrics for each Claude code instance. The model used was MiniMax M2.5, served via vLLM on the eight MI350X GPUs. While not the top-ranked coding model on public leaderboards, M2.5 is a capable model that many users run locally, including many of our developer friends.
For a baseline reference point, we use Anthropic’s Claude Opus 4.6 average output throughput via OpenRouter.ai, one of the most popular routing services for production API access. That baseline comes in at approximately 37 tokens per second per API request.
We measured two key metrics: the average output tokens per second per Claude Code session (what each developer experiences) and the aggregate output tokens per second across all sessions (the total work the server is producing).
Looking at the results, a single concurrent session delivers 38.8 tok/s per user and 38 tok/s aggregate, slightly above the OpenRouter cloud baseline. At two sessions, the system edges up to 39.5 tok/s per user as vLLM’s batching begins to amortize overhead, with aggregate throughput climbing to 63 tok/s. Four concurrent sessions are held at 37.3 tok/s per user, matching the cloud baseline while serving four developers simultaneously, with aggregate throughput reaching 128 tok/s. From eight sessions onward, per-instance throughput begins to decline: 34.6 tok/s per user at eight sessions, 31.4 tok/s at sixteen with an aggregate of 190 tok/s, and settles around 23 tok/s per user at 32 and 64 sessions, while aggregate throughput climbs to 578 tok/s and 986 tok/s, respectively. This is the classic batching-throughput-versus-interactivity trade-off: the system can achieve significantly higher total throughput by batching more requests, but each user experiences slower response times. Even at 64 concurrent users, each developer still sees a usable interactive experience, though noticeably slower than the cloud baseline.
For organizations weighing the cost of dozens of simultaneous commercial API subscriptions against self-hosted infrastructure, the tradeoff is clear: a single MI350X node can serve a development team of 16 to 32 engineers, maintaining per-user response speeds within 60-85% of the cloud baseline while delivering aggregate output of 600 to 1,000 tok/s, with added benefits of data locality, no per-token API charges, and full control over model selection.
vLLM Online Serving – LLM Inference Performance
vLLM is one of the most popular high-throughput inference and serving engines for LLMs. The vLLM online serving benchmark evaluates the real-world serving performance of this inference engine under concurrent requests. It simulates production workloads by sending requests to a running vLLM server, with configurable parameters such as request rate, input/output lengths, and the number of concurrent clients. The benchmark measures key metrics, including throughput (tokens per second), time to first token, and time per output token (TPOT), helping users understand how vLLM performs under different load conditions.
We tested inference performance across a comprehensive suite of models spanning various architectures, parameter scales, and quantization strategies to evaluate throughput under different concurrency profiles.
GPT-OSS 120B and 20B
The GPT-OSS model family was tested in both 120B and 20B configurations on the Supermicro H14.
GPT-OSS 120B
The 120B model under an equal workload (256/256) delivers 313.42 tok/s at BS=1, reaches 11,261.72 tok/s at BS=64, and peaks at 33,538.23 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 1,724.84 tok/s, climbs to 36,156.80 tok/s at BS=32 and 79,247.76 tok/s at BS=128, peaking at 84,018.79 tok/s at BS=256. Decode-heavy (1k/8k) grows from 288.90 tok/s at BS=1 to 20,602.52 tok/s at BS=256, with latency remaining well-controlled at lower concurrency levels.
GPT-OSS 20B
The 20B model delivers 485.17 tok/s at BS=1 under the equal workload, reaching 17,986.36 tok/s at BS=64 and peaking at 62,247.52 tok/s at BS=256. Prefill-heavy starts at 3,120.72 tok/s, climbs to 48,132.52 tok/s at BS=32 and 83,968.71 tok/s at BS=64, peaking at 123,714.50 tok/s at BS=256—the highest absolute prefill throughput recorded across both model sizes. Decode-heavy grows from 378.20 tok/s at BS=1 to 32,468.67 tok/s at BS=256, delivering roughly 1.6× the decode throughput of the 120B at peak concurrency while maintaining tighter latency characteristics throughout.
Qwen3 Coder 30B A3B Instruct and FP8 Instruct
The Qwen3-Coder-30B-A3B-Instruct on the Supermicro H14 was tested at both standard (BF16) and FP8 precisions.
Qwen3-Coder-30B-A3B-Instruct (BF16)
At BF16, the equal workload (256/256) delivers 240.53 tok/s at BS=1, reaching 13,312.70 tok/s at BS=64 and 21,333.79 tok/s at BS=128, with peak throughput of 34,980.97 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 1,276.76 tok/s, climbs to 25,069.32 tok/s at BS=32 and 50,198.94 tok/s at BS=128, peaking at 51,550.66 tok/s at BS=256. Decode-heavy (1k/8k) grows steadily from approximately 188 tok/s at BS=1 to 11,782 tok/s at BS=256, maintaining the tightest latency profile of the three scenarios.
Qwen3-Coder-30B-A3B-Instruct (FP8)
The FP8 variant delivers 188.92 tok/s at BS=1 under the equal workload, reaching 10,866.27 tok/s at BS=64 and 17,617.60 tok/s at BS=128, peaking at 25,928.77 tok/s at BS=256—running slightly behind BF16 results across the full range. Prefill-heavy starts at 860.07 tok/s, climbs to 20,513.77 tok/s at BS=32 and 44,205.46 tok/s at BS=128, peaking at 47,179.15 tok/s at BS=256. Decode-heavy grows from 133.79 tok/s at BS=1 to 11,014.95 tok/s at BS=256, scaling consistently and remaining close to BF16 throughout.
Mistral Small 3.1 24B Instruct 2503
The Mistral-Small-3.1-24B-Instruct-2503 on the H14 was tested with both standard and FP8-dynamic precision, showing consistent scaling across all three workload profiles.
Mistral-Small-3.1-24B-Instruct-2503 (BF16)
With BF16 precision, the equal workload (256/256) delivers 236.15 tok/s at BS=1, reaching 15,494.56 tok/s at BS=64, 24,216.52 tok/s at BS=128, and peaking at 30,530.54 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 1,429.41 tok/s, climbs to 29,631.68 tok/s at BS=32 and 54,871.74 tok/s at BS=128, peaking at 53,740.04 tok/s at BS=256. Decode-heavy (1k/8k) grows from 242.66 tok/s at BS=1 to 13,559.19 tok/s at BS=256, scaling steadily across the full range.
Mistral-Small-3.1-24B-Instruct-2503 (FP8-dynamic)
The FP8-dynamic variant delivers 184.25 tok/s at BS=1 under the equal workload, reaching 16,113.95 tok/s at BS=64 and 26,409.01 tok/s at BS=128, peaking at 40,742.04 tok/s at BS=256. Prefill-heavy starts at 1,210.06 tok/s, climbs to 28,773.52 tok/s at BS=32 and 57,765.02 tok/s at BS=128, peaking at 56,093.09 tok/s at BS=256, leading the standard precision result from BS=64 onward. Decode-heavy grows from 183.94 tok/s at BS=1 to 14,557.94 tok/s at BS=256, tracking closely through mid-range before pulling slightly ahead at BS=128 and BS=256.
Llama 3.1 8B Instruct
For the Llama-3.1-8B-Instruct, we saw the equal workload (256/256) delivers 373.26 tok/s at BS=1, reaching 19,363.33 tok/s at BS=64, 34,155.70 tok/s at BS=128, and peaking at 51,467.30 tok/s at BS=256. Prefill-heavy (8k/1k) starts at 1,959.04 tok/s, climbs to 37,227.63 tok/s at BS=32 and 60,062.40 tok/s at BS=64, peaking at 77,658.50 tok/s at BS=128 before tailing off slightly to 76,893.77 tok/s at BS=256. Decode-heavy (1k/8k) starts at 326.48 tok/s, reaching 17,877.52 tok/s at BS=128 and 19,326.35 tok/s at BS=256, maintaining lower per-token latency further into the concurrency range than any of the larger models tested.
MiniMax M2.5
The MiniMax-M2.5 on the H14 rounds out the model lineup, sitting between the Kimi K2.5 and the mid-sized models in terms of throughput profile, with characteristics that reflect its mixture-of-experts architecture. The equal workload (256/256) delivers 79.31 tok/s at BS=1, reaching 5,029.76 tok/s at BS=64, 7,801.10 tok/s at BS=128, and 14,391.98 tok/s at BS=256. Prefill-heavy (8k/1k) shows the strongest scaling of the three scenarios, starting at 424.41 tok/s and climbing to 10,376.75 tok/s at BS=32 and 20,658.57 tok/s at BS=128, peaking at 23,689.18 tok/s at BS=256. Decode-heavy (1k/8k) scales steadily to 4,257.68 tok/s at BS=128 and 6,068.70 tok/s at BS=256, offering the most consistent latency growth across the full concurrency range.
Kimi K2.5
The Kimi K2.5 1-trillion-parameter model on the H14 is the largest and smartest model tested in this review, and its throughput reflects that weight.
The equal workload (256/256) delivers 72.06 tok/s at BS=1, reaching 2,693.07 tok/s at BS=64, 4,244.27 tok/s at BS=128, and peaking at 6,527.62 tok/s at BS=256. Prefill-heavy (8k/1k) scales more aggressively, starting at 185.29 tok/s and reaching 3,798.85 tok/s at BS=32 and 9,153.12 tok/s at BS=128, with peak throughput of 11,256.69 tok/s at BS=256. The step increase from BS=128 to BS=256 carries a significant latency cost, indicating the system is approaching its memory and compute limits at full batch depth for this model size. Decode-heavy (1k/8k) grows from 29.88 tok/s at BS=1 to 2,513.85 tok/s at BS=256, delivering the tightest scaling curve of the three scenarios while demonstrating consistent throughput gains across the full range.
Conclusion
The AMD Instinct MI350X delivers very competitive inference performance across the workload profiles tested here, and the Supermicro AS-8126GS-TNMR provides a well-engineered platform to take advantage of it. With 288GB of HBM3e per accelerator and eight GPUs interconnected over Infinity Fabric, the 2.3TB of aggregate GPU memory available in a single node is sufficient to serve trillion-parameter models like Kimi K2.5 and MiniMax M2.5 without requiring multi-node distribution or model-partitioning workarounds. This capability materially simplifies deployment architecture for large-scale inference.
Smaller models also delivered strong results. Llama 3.1 8B exceeded 77,000 tok/s under prefill-heavy workloads, and mid-range architectures such as Mistral Small 3.1 24B and Qwen3 Coder 30B sustained high throughput with well-controlled latency across the concurrency range. Across the board, the results indicate a hardware platform that scales predictably under load rather than falling off a cliff at higher batch depths.
ROCm 7.2 brings significant improvements to the AMD inference software stack, particularly when paired with vLLM 0.18. This pairing delivers a noticeably more stable and higher-performing serving experience than prior ROCm generations, with broader framework support and fewer of the rough edges that characterized earlier Instinct deployments. The ecosystem momentum around AMD hardware is also worth noting: upstream vLLM now maintains a dedicated AMD ROCm CI pipeline, and Meta’s multi-generation deployment commitment at the 6-gigawatt scale reinforces that production validation extends well beyond controlled benchmarking environments.
The Claude Code serving evaluation adds a practical lens to the raw throughput numbers. A single MI350X node sustained near-cloud-baseline response speeds for up to 16 concurrent coding sessions and remained interactive with up to 64 simultaneous users while producing nearly 1,000 tok/s of aggregate output. For organizations weighing the cost of commercial API subscriptions against self-hosted infrastructure, the economics become straightforward at that density, with additional advantages in data locality, elimination of per-token costs, and unrestricted model selection.
Supermicro’s JumpStart program continues to earn its place in the infrastructure evaluation process. Bare-metal access to production hardware, with no provisioning overhead, allowed us to run real workloads under real-world conditions throughout the full test window. For teams conducting accelerator procurement evaluations, this level of hands-on access remains far more informative than spec sheet comparisons or curated vendor demonstrations.
Supermicro JumpStart Program
Product Page – GPU A+ Server AS -8126GS-TNMR
The post Supermicro JumpStart Review: H14 with AMD Instinct MI350X appeared first on StorageReview.com.