PC makers claim 45 TOPS but devs only achieve 1% of that – where is the promised performance?


New Windows AI PCs and tablets with Qualcomm chips boast a neural processing unit (NPU) with 45 TOPS, or tera operations, per second. That's 45 with twelve zeros – a lot of operations. Yet, in real-world applications, developers cannot reproduce anything close to that number.

Developers from Useful Sensors did not find a benchmark to demonstrate how fast the new chips are. So they came up with their own, which only does one thing – runs simple operations and counts how fast the chip performs them.

Their benchmark couldn’t reproduce anything close to the boasted 45 TOPS.

ADVERTISEMENT

“We see 1.3% of Qualcomm's NPU 45 Teraops/s claim when benchmarking Windows AI PCs”, they shared on GitHub after testing a Microsoft Surface Tablet with a Snapdragon X Elite chip.

That's only 0.573 trillion operations instead of 45 trillion each second.

The developers were astonished to find that NPU was even slower in their benchmark than the CPU itself. Snapdragon has 12 CPU cores running at 3.4 Ghz, and those clocked 821.1 billion operations per second. Still, that's a measure of just 0.82 TOPS.

“This is not ideal for an accelerator, even though it could still potentially offer energy or sustained performance advantages that make it worth using,” they said.

They used Qualcomm's native backend on the ONNX (Open Neural Network Exchange) framework, attempted to follow best practices from the documentation, and even asked the community for ways to help improve performance, as it would help them optimize their actual applications.

The devs also attempted to run the same model on a powerful Nvidia Geforce RTX 4080 Laptop GPU, which clocked 2.16 trillion operations per second – four times more – but still far from 45 TOPS.

Their code may not be the best possible, but it’s enough to demonstrate that the promised performance may be very hard to achieve.

surface-pro
ADVERTISEMENT

So, what’s happening?

The most likely culprit is the memory bandwidth. No consumer device can feed data at such a high rate that it could satiate the hungry NPU.

It’s like trying to stream video on very low-speed internet – you spend most of the time waiting for the clip to buffer.

Snapdragon X Elite supports 136GB/s memory bandwidth, which is impressive and one of the highest compared to other chips in small consumer devices. But do you notice the letter ‘G’ instead of a ‘T’?

Forty-five trillion operations (TOPS) per second is an enormous number. If you wanted to transmit just a single bit – the smallest amount of information – for each operation, you would need a staggering 5.625TB/s (terabytes per second) of memory bandwidth.

This is far beyond what even modern L1 caches in consumer devices can achieve – most of them fall below 1TB/s.

Computers use clever shortcuts to handle massive calculations efficiently.Instead of working with single numbers at a time, NPUs load matrices of numbers and perform matrix multiplications in parallel. This way, the computational cost grows cubically (in the worst case), while the data needed grows with the square, depending on the matrix sizes.

For example:

  • When multiplying two simple 2x2 matrices: the data needed is 8 numbers (4 for each matrix), and the operations needed – 12 in total (8 multiplications and 4 additions).
  • For multiplication of two 10x10 matrices: the data needed – 200 numbers (100 for each matrix), the operations needed – 1,900 (1,000 multiplications and 900 additions).
  • For two 1,000x1,000 matrices, you would need to load 2 million numbers (1 million for each) but perform nearly 2 billion operations (1 billion multiplications and 999 million additions).

Knowing that the NPU benchmark developers attempted to run a simulation with huge matrices with 8-bit (or one-byte) numbers. They explain, “This benchmark is designed to resemble some real-world models we depend on. It runs 6 large matrix multiplications that are similar to the most time-consuming layers in transformer models.”

ADVERTISEMENT

However, even with matrices, there are limits. Even large matrices exceed small caches and require constant memory (RAM) access. Bigger workloads, especially at higher precision, require more system resources and eventually hit one of many possible bottlenecks.

Do manufacturers lie?

The benchmark results sparked a discussion on Hacker Forums, where some users appear unhappy.

“I can't imagine a sports car advertising a 0-100km/h spec of 2 seconds where a user is unable to get below 5 seconds,” one user posted.

“The NPU may have even worse access to memory than the CPU, but the bottom line is that neither one of them has anything close to what it needs to actually deliver anything like the marketing headline performance number on any realistic workload,” another user said.

“I bet a lot of people have bought those things after seeing ‘45 TOPS’, thinking that they'd be able to usefully run transformers the size of main memory, and that's not happening on CPU or NPU.”

Some developers offered a way to measure higher performance than the 573 Giga operations per second, as measured by the Useful Sensors benchmark. One strategy could be to only use a dataset that fits the cache.

As programmer Dmitry Grinberg described, by ‘multiplying the same 128x128 matrix from cache to cache.’

“That gets you perfect MAC (multiply-accumulate operation) utilization with no need to hit memory. Gets you a big number that is not directly a lie – that performance is attainable, on a useless synthetic benchmark,” the post reads.

Would the NPUs achieve the advertised performance by meaninglessly flipping bits on their own? We don’t yet know. Many strategies exist to optimize memory-bound or compute-bound workloads, but the 45 TOPS claims still need to be demonstrated.

ADVERTISEMENT

However, for many real-world applications that require constant memory access – such as large language models – the performance doesn’t come close. Limited memory bandwidth is not yet sufficient to sustain the claimed TOPS numbers in many workloads.

“Memory bound workload is memory bound. Doesn’t matter how many TOPS you have if you’re sitting idle waiting on DRAM during generation,” yet another Hacker News pro posted.

This is relevant not only for Microsoft but also for all manufacturers boasting large TOPS numbers. Cybernews has reported that memory size is also one of the limitations when it comes to AI deployment.

Cybernews reached out to both Microsoft and Qualcomm, asking how could we achieve and demonstrate the advertised performance. We’re still waiting for a response, which we’d gladly include in this article.

NPUs are still much faster

Qualcomm does not advertise how many operations per second (FLOPS) the new CPUs are capable of performing, and how those numbers can compare to the TOPS of the NPU.

However, other benchmarks reveal that NPUs still deliver much higher performance in real-world applications, leading to higher efficiency.

Geekbench introduced an AI benchmark that uses computer vision, natural language processing, and other workloads similar to real-world tasks. While it does not calculate operations per second, its results reveal how much faster an NPU can be.

The same Microsoft Surface Pro device with the same Snapdragon X chip demonstrates different results when tested on two different backends – one suited for the CPU and one for the NPU, as the chart below shows.

ADVERTISEMENT

Here, CPUs and NPUs are comparable when testing in single precision, which uses 32 bits to represent one single number. However, the hardware achieves much higher performance when using lower precision numbers, represented in 16 or 8 bits.

In half-precision calculations (16 bits), the NPU of Surface Pro is more than six times faster than the CPU. The quantized (8-bit) scores deliver another performance improvement for both CPU and NPU. The NPU was 3.6 times faster than the CPU.

The performance between the CPU and NPU varies even more across a variety of workloads.

Nowadays, most small AI models are optimized (quantized) for even a narrower range of 4-bit numbers, still achieving decent accuracy and increasing performance even more, saving memory and memory bandwidth.

Qualcomm does not advertise the number of operations per second (FLOPS) the new CPUs in Surface Pro are capable of performing or how those numbers compare to the NPU's TOPS.

The ONNX, on which the device was tested, is a framework for machine learning models with both the CPU backend running on the standard cores and the QNN (Qualcomm Neural Network) backend, designed to utilize the NPU.