Huawei challenges Nvidia’s dominance with its new AI supercomputer


Researchers from Huawei and Chinese infrastructure startup SiliconFlow claim that Huawei’s new supercomputer architecture CloudMatrix delivers more performance at a higher efficiency compared to Nvidia’s H100 and H800 systems.

Huawei introduced its new CloudMatrix 384 AI system earlier this year. It comprises 384 Ascend 910C neural processing units (NPUs), 192 Kunpeng CPUs based on ARM architecture, and other hardware components into a unified supernode, interconnected via an ultra-high-bandwidth, low-latency Unified Bus (UB) network.

The researchers now claim that the architecture has enabled CloudMatrix 384 to surpass the performance of Nvidia H800 GPUs. Each of the NPUs outputs hundreds of tokens per second.

ADVERTISEMENT

“Our extensive evaluations with the DeepSeek-R1 model demonstrate that CloudMatrix-Infer achieves remarkable throughput, delivering 6,688 tokens/s per NPU in the prefill stage and 1,943 tokens/s per NPU during decoding, while consistently maintaining a low latency below 50 ms per output token,” the technical paper reads.

These results correspond to compute efficiencies of 4.45 tokens/s/TFLOPS for prefill and 1.29 tokens/s/TFLOPS for decode, both of which surpass the published efficiencies of leading frameworks like SGLang on NVIDIA H100 and DeepSeek on H800.”

What did Huawei do differently to gain these performance numbers from its chips? A key innovation is a super-fast internal network that allows all the NPUs to communicate with each other directly, removing bottlenecks that slow down current systems.

The paper explains that the architecture, unlike conventional hierarchical designs, enables “direct all-to-all communication via Unified Bus, allowing compute, memory, and network resources to be dynamically pooled, uniformly accessed, and independently scaled.”

This architecture enables very high parallelism, which benefits large-scale “mixture of experts” models such as DeepSeek-R1.

However, Huawei did not compare its system to Nvidia’s latest chips.

While Huawei expects to “reshape the foundation of AI infrastructure” with CloudMatrix, Tom’s Hardware previously noted that Chinese companies cannot access Nvidia’s best technologies anyway.

Ernestas Naprys Gintaras Radauskas justinasv Niamh Ancell BW
Be the first to know and get our latest stories on Google News
ADVERTISEMENT

One Huawei Ascend 910C processor is dwarfed by the current leader, Nvidia’s B200 chip, which has over three times more TFLOPS of compute power (2.500 TFLOPS vs 780 TFLOPS), and the memory bandwidth, which is also more than two times larger (8TB/s vs. 3.2 TB/s).

However, the whole CloudMatrix system incorporates over five times more chips (384) than the NVIDIA GB200 NVL72 exascale computer in a single rack (72 GPUs). The Chinese system takes up four racks and also requires 559 kW of electrical power, four times more than Nvidia’s.

Yet, the complete CloudMatrix 384 system, with 300 petaFLOPS of compute power and almost 50 terabytes of high-speed HBM memory, is, on paper, more performant in every metric than Nvidia’s GB200 NVL72.

As Huawei tries to overcome US sanctions, the new design, connecting hundreds of powerful processors to work together seamlessly, will meet the ever-growing demands for computing power and allow semiconductor-restrained China to compete in the AI race.