TurboQuant: How Google Compresses the KV Cache by 6× Without Losing Accuracy

2026 10 min read Quantization · KV Cache · LLM Inference

As LLMs serve longer contexts - documents, codebases, multi-turn conversations - the KV cache becomes one of the dominant memory costs at inference time. A single long-context request can consume more GPU memory for the KV cache than for the model weights themselves. Multiply that by hundreds of concurrent users, and it becomes the primary constraint on batch size and throughput.

TurboQuant, from Google Research and presented at ICLR 2026, compresses KV cache tensors to 3–4 bits with provably zero accuracy loss. That's a 6× memory reduction and up to 8× throughput improvement on H100 GPUs - and it requires no fine-tuning, no calibration data, and no changes to the model.

The algorithm has three steps: random rotation, per-element quantization, and QJL residual correction. Each is elegant. Together they make extreme KV cache compression safe.

The whole algorithm in one line. Each step exists to make the next one safe: the rotation manufactures a Gaussian, the quantizer is tuned for a Gaussian, and QJL removes the bias the quantizer leaves behind.

The Core Problem: KV Cache Outliers

KV cache tensors - the stored key and value vectors for each attention head - share a statistical property with most neural network activations: outliers. A few coordinates dominate the magnitude range, and naive quantization is ruined by them.

Suppose a key vector has 4 coordinates:

[ 0.001,  0.002,  0.001,  47.3 ]

One coordinate dominates. If you apply 3-bit quantization to this vector, all 8 levels must span the range 0.001 to 47.3. Each bucket is roughly 6 units wide, and the three small values - which often carry important attention signal - all collapse into the same bin. You've thrown away 75% of the information because one outlier dominated the range.

This is why naive INT4 quantization of KV caches degrades model quality. TurboQuant solves it with three steps.

Why naive INT4 wrecks a KV cache. The quantization grid has to stretch from 0.001 to 47.3 to cover the outlier, so every bucket is ~5.9 units wide and the three informative small values are quantized to the same number.

Step 1 - Random Rotation: Spread the Energy

The first step is to multiply the KV vector by a random orthogonal matrix R:

x' = R · x

This is a random rotation in high-dimensional space. It mixes all coordinates together, spreading the outlier's energy across the entire vector:

Before rotation:  [ 0.001,  0.002,  0.001,  47.3  ]
After rotation:   [ 23.6,  -23.6,   23.6,  -23.6  ]

The L2 norm is preserved (rotations are isometric), but now no single coordinate dominates. All 8 quantization levels are used efficiently - none wasted on covering a sparse tail.

Key mathematical fact: A random rotation of any vector produces coordinates that are approximately i.i.d. Gaussian - regardless of the original distribution. This sets up the next step perfectly.

A full random matrix multiplication is O(d²) and expensive. TurboQuant uses structured random rotations (Hadamard matrices + random sign flips) that run in O(d log d) - the same trick as Fast Johnson-Lindenstrauss transforms.

The trick that makes the rest work. Rotating by a random orthogonal matrix preserves the vector's length but makes its coordinates approximately i.i.d. Gaussian whatever the original distribution was, so the quantizer can finally spend all 8 levels on signal instead of on a sparse tail.

Step 2 - Per-Element Quantization

After rotation, the coordinates are approximately Gaussian with no outliers. TurboQuant applies high-quality per-element quantization to this well-conditioned distribution.

Because the rotation has spread energy uniformly, the quantizer can focus all its levels on the dense central region - no bits wasted on tails. At 3 bits (8 levels), this achieves near-optimal compression of the rotated KV vectors. The rotation and quantizer are designed as a matched pair: rotation makes input approximately Gaussian, quantizer is optimized for Gaussian input.

Quantizer type	Level placement	Efficiency on Gaussian input
Uniform (naive)	Evenly spaced across full range	Wastes bits on tails
Per-element (TurboQuant)	Concentrated in high-density region	Near-optimal MSE

Step 3 - QJL Residual Correction: Make It Unbiased

Even with good quantization, there's a residual - the gap between the true value and the nearest codepoint:

True value:    0.31
Nearest level: 0.28
Residual:  r = 0.03

If these residuals are systematically biased in one direction, attention scores drift from their true values - corrupting model outputs subtly but persistently. The QJL (Quantized Johnson-Lindenstrauss) correction step eliminates this:

Compute the residual vector r.
Pick a random projection matrix S and compute S·r.
Store only sign(S·r) - one bit per projected coordinate.
At reconstruction, use sign(S·r) to recover an unbiased estimate of r.

The Johnson-Lindenstrauss lemma guarantees that the sign of a random projection is an unbiased 1-bit sketch of the direction of the original vector. Even though only 1 bit per coordinate of r is stored, the reconstruction is unbiased in expectation.

Total storage: (b−1) bits for the main quantization + 1 bit for QJL residual = b bits per coordinate. At b=3: 6× memory reduction vs FP16 KV cache, with provably zero bias.

Why This Matters for LLM Serving

KV cache is the inference-time memory bottleneck for long contexts. With a 128K token context window and a large model, a single request's KV cache can consume tens of gigabytes. At scale - hundreds of concurrent users - this caps batch size and directly limits throughput.

TurboQuant's reported results on H100:

Metric	FP16 baseline	TurboQuant (3-bit)
KV cache memory	1×	~6× reduction
Throughput (H100)	1×	Up to 8× improvement
Accuracy loss	-	Zero (provably unbiased)
Training / fine-tuning needed	-	None

The training-free property is critical for practical adoption. TurboQuant can be applied post-hoc to any existing model - Llama, Gemma, Mistral - without retraining or calibration datasets.

The part that makes this deployable rather than merely clever: it is a drop-in. No calibration set, no retraining, no architecture change, which is what separates a serving win from a research result.

Connecting to My Work on RAMP

TurboQuant and my RAMP paper (arXiv:2603.17891) operate on different parts of the inference stack. RAMP is about post-training weight quantization - using RL to assign per-layer bit widths that minimize perplexity under a memory budget. TurboQuant is about KV cache quantization at inference time.

Both share the core insight that not all dimensions deserve equal precision. RAMP uses RL to discover which layers can tolerate fewer bits. TurboQuant uses random rotation to make all KV coordinates equally representable before applying a fixed quantizer. Different targets, same underlying philosophy.

The Scale Folding technique in RAMP - which absorbs activation outliers into per-channel weight scaling - is also spiritually related to TurboQuant's rotation step. Both are preconditioning operations designed to eliminate outliers before they destroy the quantization budget.

Summary

Step	What it does	Why it works
Random Rotation	Spreads KV outlier energy across all coordinates	Rotation preserves L2 norm; output is approximately Gaussian
Per-Element Quantization	Quantizes the uniform rotated representation efficiently	No wasted bits on sparse tails; near-optimal MSE
QJL Residual	Corrects quantization error with 1 extra bit	Johnson-Lindenstrauss: random projections give unbiased sign sketches

The result: 3-bit KV cache, ~6× memory reduction, up to 8× throughput on H100, provably zero accuracy loss, no training required. Presented at ICLR 2026.

Read more: Google Research Blog - TurboQuant
Related: my paper on RAMP (mixed-precision weight quantization via RL).
Questions? @Asg_Wolverine