Researchers create an algorithm that slashes AI energy use by 95%


Multiplying tensors is so 2023 – why not add integers instead? Researchers from BitEnergy AI propose a 95% less computationally intensive method to run large language models (LLMs). Established graphics processor unit (GPU) vendors may not be happy.

Current AI models rely on floating-point tensor multiplications, usually performed on GPU chips, which are very compute-intensive. However, researchers show that a new cost-effective algorithm could achieve the same or even higher precision than 8-bit floating-point calculations.

Storing neural network weights in 8-bit format is not considered state-of-the-art, as 32-bit or 16-bit numbers offer higher precision. However, for storage and computation efficiency, AI models are often ‘downscaled’ to 8-bit or even smaller 4-bit representations in a process called quantization.

ADVERTISEMENT

That's where the new algorithm comes into play.

“Our method requires less computation but achieves higher accuracy,” BitEnergy AI researchers say in a paper.

They argue that the new algorithm costs significantly less computation resources than 8-bit floating point multiplication but achieves higher precision.

How did researchers achieve that? By replacing costly floating-point multiplication with integer addition. Multiplying two 32-bit floating-point numbers has a 37 times higher energy cost than adding two 32-bit integers.

“We find that a floating point multiplier can be approximated by one integer adder with high precision.”

This way, the proposed linear-complexity multiplication (L-Mul) method “can potentially reduce 95% energy cost by element-wise floating point tensor multiplications and 80% energy cost of dot products.”

In early 2023, the ChatGPT service reportedly consumed 564 MWh of electricity per day, equivalent to the total daily electricity usage of 18,000 families in the US.

Researchers ran benchmarks of some current small open-source LLMs to demonstrate almost no precision loss while significantly reducing energy consumption.

ADVERTISEMENT

The paper explains that moving tensors between on-die memory and high-bandwidth memory (HBM) is the main bottleneck of time and energy consumption on regular GPUs. Reducing the I/O operations in transformer models and making the best use of the HBM can significantly improve the efficiency of AI training and inference.

Reimagining the way calculations are made may require hardware optimizations to unlock L-Mul's full potential. Researchers are already working on that.

“To unlock the full potential of our proposed method, we will implement the L-Mul and L-Matmul

kernel algorithms on the hardware level and develop programming APIs for high-level model design. Furthermore, we will train textual, symbolic, and multi-modal generative AI models optimized for deployment on L-Mul native hardware.”

That may affect the booming GPU market, which is dominated by a few select players.