Google TurboQuant Cuts LLM Memory 6x at Zero Accuracy Cost
Google's plug-in KV cache compression achieves 6x memory reduction and 8x attention speedup on Llama-3.1 with no fine-tuning required.
Six times less GPU memory, identical model quality, and no retraining required. Those are the numbers behind TurboQuant, a new compression system from Google Research targeting the single hardest bottleneck in large language model deployment: the KV cache.
The system was developed by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni — the latter serving as VP and Google Fellow at Google Research. The focus is on a problem that has resisted clean solutions at scale.
The KV cache stores key-value attention pairs during inference. At scale, it devours GPU memory faster than model weights themselves, and until now every serious compression attempt traded accuracy for efficiency. TurboQuant breaks that tradeoff with two complementary techniques. PolarQuant converts Cartesian key-value vectors into polar coordinates, eliminating expensive per-block normalization overhead. QJL (Quantized Johnson-Lindenstrauss) then applies a 1-bit residual correction layer on top, cleaning up quantization noise without touching model weights.
How the Compression Actually Works
The polar coordinate approach is more elegant than it sounds. Standard vector quantization normalizes vectors in blocks, adding latency on every forward pass. Changing coordinate systems sidesteps that overhead entirely. The result is near-optimal distortion rates within a constant factor of approximately 2.7 across all bit-widths, meaning the method sits provably close to the mathematical ceiling for this class of problem.
On Llama-3.1, compressing the KV cache to 3 bits per channel matches full FP16 precision across all five long-context benchmarks tested: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. Pushing to 2.5 bits per channel, quality degradation remains marginal. Hardware tests on NVIDIA H100 GPUs show an 8x speedup in attention logit computation, a gain that flows directly to inference throughput and per-token cost.
A Structural Shift for AI Hardware
The practical upshot is stark. Models that previously required an 80GB NVIDIA A100 can now run on a consumer Mac Mini with 16GB unified memory. That is not an incremental improvement; it is a category shift. No lossless KV cache compression technique had crossed that threshold before.
The competitive implications run deeper than unit economics. Cloud providers have built meaningful moats around GPU availability and memory capacity. A plug-in compression layer that cuts memory requirements sixfold without retraining erodes that structural advantage. Every production deployment using TurboQuant could reduce GPU memory costs by roughly 80%, according to the Google Research blog. Smaller AI companies that previously could not afford the infrastructure to serve large models at scale now have a credible path forward.
As a secondary result, TurboQuant also outperforms product quantization and RabbitQ on vector similarity search recall, while substantially reducing indexing time — extending its utility beyond KV cache compression into broader retrieval workloads.