
Google has recently revealed TurboQuant, a compression algorithm that reduces the memory footprint of large language models (LLMs) without sacrificing performance.
-
Google's TurboQuant cuts AI memory usage up to 6x and boosts performance up to 8x.
-
It compresses key-value caches without requiring additional training or finetuning.
-
The algorithm could enable powerful AI models to run on limited hardware, including mobile devices.
One of the biggest technical bottlenecks of current AI technology is memory usage. LLMs, such as those used in chatbots like Google Gemini, rely heavily on what’s called “key-value caches” to store intermediate data during processing.
Key-value caching is a technique that reduces the number of calculations when an AI model generates text by remembering important information from previous steps. Instead of recomputing everything from scratch, the model reuses what has already been calculated, making text generation much faster and more efficient.
However, key-value caches can quickly consume vast amounts of memory, limiting speed and scalability. Google’s TurboQuant addresses this issue by compressing memory structures more efficiently than existing methods.
According to Google, TurboQuant can reduce memory requirements by up to six times while delivering performance improvements of up to eight times.
Unlike traditional compression techniques, TurboQuant doesn’t require additional training or finetuning, which makes it a lot easier to deploy in all sorts of applications. In addition, it would lower the cost of running AI systems and enable more powerful models to operate on limited hardware, including mobile devices.
Google suggests that TurboQuant isn’t just a practical engineering solution, but offers a “fundamental algorithmic contribution” in real-world applications.
“This rigorous foundation is what makes them robust and trustworthy for critical, large-scale systems,” the tech company adds.
TurboQuant not only solves problems around key-value cache bottlenecks in LLMs, but also helps to better understand intent and meaning when people enter keywords into prompts.
“Techniques like TurboQuant are critical for this mission. They allow for building and querying large vector indices with minimal memory, near-zero preprocessing time, and state-of-the-art accuracy. This makes semantic search at Google’s scale faster and more efficient. As AI becomes more integrated into all products, from LLMs to semantic search, this work in fundamental vector quantization will be more critical than ever,” Google concludes its blog post.
Unlock more exclusive Cybernews content on YouTube.
Your email address will not be published. Required fields are markedmarked