Reading view

There are new articles available, click to refresh the page.

China tops the list of fastest supercomputers with a CPU-only behemoth, ending US champion El Capitan's reign — 2.198 exaflops of performance without a single GPU

China's LineShine supercomputer has taken the top spot on the 67th-edition TOP500 list, posting 2.198 exaflops on the High Performance Linpack benchmark and pushing the AMD-powered El Capitan into second place by more than 20%. The system, installed at the National Supercomputing Centre in Shenzhen (NSCS) and built by the Shenzhen Cloud Computing Center, used no GPUs or accelerators of any kind, and reached the figure with 13,789,440 cores of domestically designed silicon, the first machine on the list to clear two exaflops of double-precision performance on CPUs alone. It’s also the first China-based system to lead the TOP500 since Sunway TaihuLight in 2017.

The fact that a sanctioned country has managed to build an exascale flagship without a single Western accelerator is one thing, but what’s more telling is that China has decided to put it on the list. For years, its fastest machines have stayed off the rankings entirely, and the decision to submit a chart-topper now is a deliberate change of posture.

A domestic stack from core to OS

LineShine is built on what NSCS calls the LingKun platform. Each of its 20,480 compute nodes carries two LX2 processors, Armv9-based parts with 304 cores running at 1.55 GHz, organized as eight clusters of 38 cores. Every core includes Arm's Scalable Vector Extension and Scalable Matrix Extension units covering FP64, FP32, BF16, FP16, and INT8.

Each of those LX2s pairs 32 GB of on-package HBM rated at up to 4 TB/s with as much as 256 GB of off-package DDR5, an arrangement that’s closer to Fujitsu's A64FX in Japan's Fugaku than to a conventional server CPU. Nodes are tied together by the proprietary LingQi interconnect, and the machine runs the homegrown Kylin OS.

It’s not known who designs the LX2 — NSCS names no vendor — but Jon Peddie Research has attributed the chip to Huawei, and the project's pilot phase reportedly ran on Huawei Kunpeng servers. The fabrication node and foundry are likewise unconfirmed. SMIC's 7nm-class process is the obvious domestic candidate by elimination, given that EUV tooling and TSMC capacity are both off the table, but nobody has documented the part to date.

Not an AI crown

LineShine also took first on HPCG, the test that rewards memory- and communication-bound workloads closer to real scientific code, at 22.00 petaflops. But on HPL-MxP, the mixed-precision benchmark that approximates AI training math, it came in only fourth at 7.92 exaflops, a 3.6 times uplift over its FP64 score.

In other words, the accelerator-based machines it beat on Linpack pull far ahead the moment precision drops. Per the TOP500 announcement, El Capitan posts 16.7 exaflops on HPL-MxP, a 9.2 times jump over its standard result, with Aurora and Frontier showing similar multipliers. Reduced-precision throughput is exactly where GPUs and APUs separate from CPUs, and LineShine has nowhere to hide it.

We can see similar issues cropping up in terms of power. LineShine draws 42,220 kW and returns 52.07 gigaflops per watt on its Linpack run. That beats Intel’s Aurora comfortably but trails El Capitan's 60.94 gigaflops per watt, so LineShine produces more total FP64 output than the Livermore system while burning roughly 42% more power to do it.

It’s worth holding onto this distinction because the TOP500 ranking is decided on FP64 Linpack, the one regime where a wide, HBM-fed CPU can still go toe-to-toe with accelerators. LineShine is a genuine double-precision champion, but it’s not a world-leading AI training machine, and its fourth-place HPL-MxP result says so.

So, why did China submit it?

China stopped submitting its fastest systems to the TOP500 around 2021, after a run of entity-list additions hit Sunway's Wuxi center and Sugon. The community has long believed that the country operated exascale hardware well before this entry: the Sunway successor OceanLight and the NUDT-built Tianhe-3 both appeared via Gordon Bell Prize science papers without ever appearing on the list. TOP500 co-founder Jack Dongarra has said for years that Chinese researchers told him they weren’t permitted to submit, and that omissions were about avoiding U.S. attention rather than any lack of capability.

Last June's list, which AMD topped while Chinese HPC remained absent, was especially conspicuous, but putting LineShine forward now reverses that. It has been reported that the system was developed without public funding, which lowers the political exposure of disclosing it, and the all-domestic design means there’s no dependency on Western parts for Washington to choke off after the fact.

Addison Snell, chief executive of HPC analyst firm Intersect360 Research, told Reuters he wasn’t surprised by the performance but by the disclosure itself, noting the surprise was that China submitted the result and wanted recognition for it. Ultimately, submitting a number-one system that runs entirely on indigenous parts is a statement that the sanctions regime hasn’t closed the gap China cares about.

AMD still dominates

The top of the list might have changed hands, but the bulk of it hasn’t. The U.S. still dominates with three of the top five in El Capitan (1.809 exaflops), Frontier (1.353 exaflops), and Aurora (1.012 exaflops), and Germany's JUPITER Booster remains the first and only European exascale system at an even 1.000 exaflops.

AMD’s silicon underpins most of the accelerated field with the company, per its own blog, now powering 191 systems on the list, up 11% year over year, and 41% of this edition's new entries. It holds three top-10 slots — El Capitan, Frontier, and the newly deployed HPC7 at Italian energy firm Eni — and contributes more than 40% of combined top-10 Linpack performance. On efficiency, it powers 56% of the top 50 Green500 systems, and its first Instinct MI355X deployments, two Cambridge Zenith systems in the UK, entered at positions 67 and 68.

None of that is dented by LineShine, not least because the two aren’t competing for the same workload. AMD’s MI300A and MI355X parts are built for mixed-precision AI arithmetic, where LineShine places fourth, and the rest of the Western labs are optimizing for that, not FP64 leaderboard positions.

El Capitan, Frontier, and Aurora all post HPL-MxP scores several times their Linpack results, enabled by hardware that LineShine doesn’t have. So, while it’s true the TOP500 crown moved to Shenzen, it did so on a benchmark that Western labs are no longer chasing with their fastest machines.

China's LineShine supercomputer dethrones US' El Capitan, secures first place in Top 500 list — first machine in the rankings to sustain more than 2 ExaFLOPS of double-precision performance using only CPUs

China's LineShine supercomputer has dethroned El Capitan as the world's number one supercomputer, going straight to the top of the charts after the National Supercomputer Center in Shenzhen (NSCS) submitted its results.

LineShine hit 2.198 FP64 ExaFLOPS in the Linpack benchmark and became the industry's first machine in the Top 500 list to sustain more than 2 ExaFLOPS of double-precision performance using only CPUs. The system is deployed at the National Supercomputing Centre in Shenzhen and was built by the Shenzhen Cloud Computing Center using semi-custom 304-core LX2 processors based on the Armv9 instruction set architecture and running at 1.55 GHz. The machine employs 13.79 million cores in total, uses proprietary LingQi interconnect, and consumes 42.2 MW of power.

From a performance-per-watt point of view, the LineShine machine delivers 52.07 GFLOPS/W, which is below El Capitan's 60.94 GFLOPS/W. However, LineShine by far outperforms Fugaku — another CPU-only supercomputer that used to be the No.1 HPC system several years ago — that can only deliver 14.78 – 16.84 GFLOPS/W depending on whether its efficiency is optimized or not.

LineShine also moved to the top of the HPCG ranking with 22.00 HPCG-PFLOPS. However, the supercomputer achieved 7.92 mixed-precision EFLOPS in HPL-MxP, which puts it behind El Capitan, Frontier, and Aurora. This limits LineShine's usability for AI training and inference, but this can be justified with its exceptional performance for traditional supercomputer tasks.

Each LX2 CPU relies on two compute chiplets and has a total of 304 CPU cores organized into eight CPU clusters containing 38 cores each. Every core includes Arm SVE (Scalable Vector Extension) and SME (Scalable Matrix Extension) units that accelerate vector and matrix operations used in AI training and scientific computing that support FP64, FP32, BF16, FP16, and INT8 data formats. The chip features a rather unusual memory architecture that pairs 32 GB of on-package HBM, offering up to 4 TB/s of bandwidth with as much as 256 GB of external DDR5 memory to maximize both bandwidth and capacity.

Despite this, the processor only gains 3.6X performance when moving from FP64 to mixed-precision data, which is lower compared to systems that integrate low-precision accelerators, such as AMD's Instinct MI300A or Intel's Ponte Vecchio. While an Armv9 CPU with SVE/SME can accelerate FP16/BF16/INT8 workloads, its mixed-precision uplift remains limited compared to systems with accelerators due to many reasons, including memory bandwidth, software maturity, and interconnect efficiency. That said, it may be too early to make final conclusions about the LX2 and its usability for mixed-precision workloads.

In any case, the very fact that a Chinese supercomputer has achieved extraordinary FP64 performance is remarkable. Furthermore, the fact that NSCS has actually submitted results to Top 500 indicates that the organization is confident that the LineShine supercomputer relies exclusively on domestic technologies and the U.S. government cannot affect the production of these technologies.

❌