TurboQuant: How Google Compresses the KV Cache by 6ร— Without Losing Accuracy

As LLMs serve longer contexts โ€” documents, codebases, multi-turn conversations โ€” the KV cache becomes one of the dominant memory costs at inference time. A single long-context request can consume more GPU memory for the KV cache than for the model weights themselves. Multiply that by hundreds of concurrent users, and it becomes the primary constraint on batch size and throughput.

TurboQuant, from Google Research and presented at ICLR 2026, compresses KV cache tensors to 3โ€“4 bits with provably zero accuracy loss. That's a 6ร— memory reduction and up to 8ร— throughput improvement on H100 GPUs โ€” and it requires no fine-tuning, no calibration data, and no changes to the model.

The algorithm has three steps: random rotation, per-element quantization, and QJL residual correction. Each is elegant. Together they make extreme KV cache compression safe.

The Core Problem: KV Cache Outliers

KV cache tensors โ€” the stored key and value vectors for each attention head โ€” share a statistical property with most neural network activations: outliers. A few coordinates dominate the magnitude range, and naive quantization is ruined by them.

Suppose a key vector has 4 coordinates:

[ 0.001,  0.002,  0.001,  47.3 ]

One coordinate dominates. If you apply 3-bit quantization to this vector, all 8 levels must span the range 0.001 to 47.3. Each bucket is roughly 6 units wide, and the three small values โ€” which often carry important attention signal โ€” all collapse into the same bin. You've thrown away 75% of the information because one outlier dominated the range.

This is why naive INT4 quantization of KV caches degrades model quality. TurboQuant solves it with three steps.

Step 1 โ€” Random Rotation: Spread the Energy

The first step is to multiply the KV vector by a random orthogonal matrix R:

x' = R ยท x

This is a random rotation in high-dimensional space. It mixes all coordinates together, spreading the outlier's energy across the entire vector:

Before rotation:  [ 0.001,  0.002,  0.001,  47.3  ]
After rotation:   [ 23.6,  -23.6,   23.6,  -23.6  ]

The L2 norm is preserved (rotations are isometric), but now no single coordinate dominates. All 8 quantization levels are used efficiently โ€” none wasted on covering a sparse tail.

Key mathematical fact: A random rotation of any vector produces coordinates that are approximately i.i.d. Gaussian โ€” regardless of the original distribution. This sets up the next step perfectly.

A full random matrix multiplication is O(dยฒ) and expensive. TurboQuant uses structured random rotations (Hadamard matrices + random sign flips) that run in O(d log d) โ€” the same trick as Fast Johnson-Lindenstrauss transforms.

Step 2 โ€” Per-Element Quantization

After rotation, the coordinates are approximately Gaussian with no outliers. TurboQuant applies high-quality per-element quantization to this well-conditioned distribution.

Because the rotation has spread energy uniformly, the quantizer can focus all its levels on the dense central region โ€” no bits wasted on tails. At 3 bits (8 levels), this achieves near-optimal compression of the rotated KV vectors. The rotation and quantizer are designed as a matched pair: rotation makes input approximately Gaussian, quantizer is optimized for Gaussian input.

Quantizer typeLevel placementEfficiency on Gaussian input
Uniform (naive)Evenly spaced across full rangeWastes bits on tails
Per-element (TurboQuant)Concentrated in high-density regionNear-optimal MSE

Step 3 โ€” QJL Residual Correction: Make It Unbiased

Even with good quantization, there's a residual โ€” the gap between the true value and the nearest codepoint:

True value:    0.31
Nearest level: 0.28
Residual:  r = 0.03

If these residuals are systematically biased in one direction, attention scores drift from their true values โ€” corrupting model outputs subtly but persistently. The QJL (Quantized Johnson-Lindenstrauss) correction step eliminates this:

  1. Compute the residual vector r.
  2. Pick a random projection matrix S and compute Sยทr.
  3. Store only sign(Sยทr) โ€” one bit per projected coordinate.
  4. At reconstruction, use sign(Sยทr) to recover an unbiased estimate of r.

The Johnson-Lindenstrauss lemma guarantees that the sign of a random projection is an unbiased 1-bit sketch of the direction of the original vector. Even though only 1 bit per coordinate of r is stored, the reconstruction is unbiased in expectation.

Total storage: (bโˆ’1) bits for the main quantization + 1 bit for QJL residual = b bits per coordinate. At b=3: 6ร— memory reduction vs FP16 KV cache, with provably zero bias.

Why This Matters for LLM Serving

KV cache is the inference-time memory bottleneck for long contexts. With a 128K token context window and a large model, a single request's KV cache can consume tens of gigabytes. At scale โ€” hundreds of concurrent users โ€” this caps batch size and directly limits throughput.

TurboQuant's reported results on H100:

MetricFP16 baselineTurboQuant (3-bit)
KV cache memory1ร—~6ร— reduction
Throughput (H100)1ร—Up to 8ร— improvement
Accuracy lossโ€”Zero (provably unbiased)
Training / fine-tuning neededโ€”None

The training-free property is critical for practical adoption. TurboQuant can be applied post-hoc to any existing model โ€” Llama, Gemma, Mistral โ€” without retraining or calibration datasets.

Connecting to My Work on RAMP

TurboQuant and my RAMP paper (arXiv:2603.17891) operate on different parts of the inference stack. RAMP is about post-training weight quantization โ€” using RL to assign per-layer bit widths that minimize perplexity under a memory budget. TurboQuant is about KV cache quantization at inference time.

Both share the core insight that not all dimensions deserve equal precision. RAMP uses RL to discover which layers can tolerate fewer bits. TurboQuant uses random rotation to make all KV coordinates equally representable before applying a fixed quantizer. Different targets, same underlying philosophy.

The Scale Folding technique in RAMP โ€” which absorbs activation outliers into per-channel weight scaling โ€” is also spiritually related to TurboQuant's rotation step. Both are preconditioning operations designed to eliminate outliers before they destroy the quantization budget.

Summary

StepWhat it doesWhy it works
Random RotationSpreads KV outlier energy across all coordinatesRotation preserves L2 norm; output is approximately Gaussian
Per-Element QuantizationQuantizes the uniform rotated representation efficientlyNo wasted bits on sparse tails; near-optimal MSE
QJL ResidualCorrects quantization error with 1 extra bitJohnson-Lindenstrauss: random projections give unbiased sign sketches

The result: 3-bit KV cache, ~6ร— memory reduction, up to 8ร— throughput on H100, provably zero accuracy loss, no training required. Presented at ICLR 2026.