TurboQuant: Google’s Breakthrough Algorithm That Compresses LLM Memory

Poniak Research

3 months ago

TurboQuant: Google's Breakthrough Compression Algorithm and the Future of Large Language Models

Google Research has introduced TurboQuant, a new algorithm that compresses the key-value cache used by large language models without sacrificing accuracy. By reducing memory usage by up to 6×, TurboQuant could enable dramatically longer context windows and cheaper inference.

In the continuous pursuit of more capable Large Language Models, one persistent obstacle has always loomed large – memory. As models like Llama, Mistral etc push context windows into the hundreds of thousands of tokens, the computational machinery powering them hits a hard limit. At the heart of this challenge lies the key-value (KV) cache—a critical data structure that stores intermediate attention computations to enable fast, autoregressive generation. Without it, every new token would require new computing attention over the entire conversation history, that renders long-context inference slower.

Google Research has recently introduced TurboQuant, a new compression algorithm. It reduces the memory needed for the KV cache by at least 6 times on average. At the same time, it delivers up to 8 times faster computation of attention scores on NVIDIA H100 GPUs. Most importantly, it does all this with zero measurable loss in accuracy – no retraining or fine-tuning required. The technique is set to be presented at ICLR 2026.

This development is significant because it directly tackles one of the main bottlenecks in running modern LLMs, especially for long contexts.

Understanding the KV Cache Problem

In transformer models, the attention mechanism compares the current token (query) with all previous tokens (keys) to decide what information to focus on. The corresponding values are then used to build the output.

To make this efficient, the model stores the keys and values of past tokens in the KV cache. For a model with a hidden size of 4096 and long sequences (say, 128,000 tokens), this cache can easily take tens of gigabytes of memory in full precision (FP16 or FP32).

This memory demand limits how long a context an LLM can handle on a given GPU. It also raises costs for companies running AI services at scale. Earlier compression methods helped somewhat, but many required extra preprocessing, worked only offline, or caused small drops in quality. TurboQuant is different — it is fully online and data-oblivious, meaning it works in real time as new tokens arrive, without needing to analyze the data first.

How TurboQuant Works

At a high level, TurboQuant is a smarter way of compressing the memory used by large language models without losing important information. Instead of storing every number with full precision, the algorithm finds ways to store the same information using far fewer bits.

The process works in three main steps.

1. Random Rotation

First, the algorithm applies a fixed mathematical rotation to the vectors that store the model’s internal data.

This rotation spreads the information more evenly across the vector. As a result, the numbers become easier to compress because their distribution becomes more regular and predictable.

In simple terms, it reorganizes the data so compression works better.

2. Efficient Quantization

Once the numbers are in this cleaner form, TurboQuant compresses them using a technique called scalar quantization.

Instead of storing each value with 16 or 32 bits, the algorithm represents them using just a few discrete levels.

This means each value can often be stored using only 3–4 bits, dramatically reducing memory usage.

3. Accuracy Correction

Language models rely heavily on similarity calculations between vectors when computing attention.

Aggressive compression could distort these similarity scores. To prevent that, TurboQuant adds a small correction step that estimates the tiny errors introduced during compression and adjusts for them.

This keeps the similarity calculations accurate.

By combining these steps, TurboQuant creates a two-stage compression method that can shrink the KV cache to around 3 bits per value while maintaining the same model accuracy. In practice, this means dramatically smaller memory usage with almost no additional computational cost.

The approach also builds on earlier research techniques such as PolarQuant and the Quantized Johnson-Lindenstrauss transform, but integrates them into a simpler and more practical system.

Real-World Performance

Google evaluated TurboQuant on several widely used long-context benchmarks, including LongBench, Needle-in-a-Haystack, ZeroSCROLLS, RULER, and L-Eval. These benchmarks cover tasks such as question answering, summarization, code reasoning, and retrieval across extremely long contexts.

The experiments were conducted on multiple open-source models, including Llama-3.1-8B-Instruct, Gemma, and models from the Mistral family.

The results were striking. At compression levels of around 3–3.5 bits per value, TurboQuant matched — and in some cases slightly exceeded — the performance of full-precision KV caches on LongBench tasks. In Needle-in-a-Haystack evaluations, the system maintained perfect recall even at context lengths exceeding 100,000 tokens.

Performance improvements were not limited to memory savings. On NVIDIA H100 GPUs, the 4-bit configuration accelerated attention logit computation by up to 8× compared with standard 32-bit representations. This translates into both faster inference and significantly lower GPU memory requirements.

Early community experiments have also reported similar behaviour on other models, suggesting that TurboQuant’s compression approach generalizes well across architectures.

Compared with earlier KV-cache quantization techniques such as KIVI, TurboQuant achieves stronger compression ratios while preserving model accuracy, making it one of the most promising efficiency improvements proposed for long-context LLM inference.

How This Changes the Future of LLMs

TurboQuant is more than just a clever optimization — it could reshape how LLMs are built and used.

First, longer contexts become practical. With 6x less KV cache memory, the same hardware can support much longer conversations, documents, or code repositories. This is especially useful for AI agents, legal document analysis, scientific research, or personalized assistants that need to remember a lot of information.

Second, lower costs. Inference expenses can drop significantly (some estimates suggest over 50% savings in certain setups) because less memory means lower GPU usage and reduced data movement. This makes powerful LLMs more affordable for smaller companies and cloud providers.

Third, better access on everyday devices. Running strong models with large contexts on laptops, workstations, or even edge devices becomes more realistic. Open-source communities can experiment more freely without needing expensive hardware clusters.

Fourth, it shifts the focus in model design. Instead of worrying so much about memory walls, researchers can explore bigger context windows, more sophisticated reasoning chains, or multimodal systems. The bottleneck moves from hardware limits to new ideas and better training data.

It also benefits retrieval-augmented generation (RAG) and vector databases, where similar compression techniques can speed up search while saving storage.

In short, TurboQuant helps break the memory barrier that has slowed down progress in scaling LLMs. Future models may no longer need to sacrifice context length or quality due to hardware constraints.

Remaining Challenges and Next Steps

Like any new technique, TurboQuant still needs real-world integration. Efficient GPU kernels (for example, using Triton) will be important to achieve the full speed gains. While it works well on the tested 7B–8B models, broader validation on larger models (70B+) and diverse workloads will be valuable.

Adoption will also depend on easy integration into popular frameworks like Hugging Face Transformers or vLLM. Early community experiments suggest this is already happening.

Looking ahead, combining TurboQuant with other efficiency methods — such as model quantization or speculative decoding — could bring even greater gains.

TurboQuant is a clean, mathematically strong solution to a real pain point in AI: the exploding memory cost of long-context generation. By compressing the KV cache to just 3 bits per value with no accuracy loss, Google has delivered a tool that can make LLMs faster, cheaper, and more capable at every scale.

This breakthrough does not just improve one part of the system — it changes the economics and possibilities of the entire field. As developers and companies start applying TurboQuant, we can expect a new wave of more efficient and accessible AI applications.

The memory wall that once limited LLMs is now much lower. What comes next will depend less on raw hardware and more on human creativity in building smarter AI systems.

Read more from Poniak Times