Large language models are powerful, but they come with a problem that's easy to understate: they're enormous. A 70B-parameter model in full float32 precision takes roughly 280 GB of memory β more than most GPU clusters at a typical research lab. Even in bfloat16, that's still 140 GB. Running these models at any kind of scale requires us to get creative.
Quantization is one of the most practical solutions we have. The idea is simple: instead of storing each weight as a 32-bit float, we compress it down to 8, 4, or even fewer bits. Done carefully, you can compress a model by 2β8Γ with only a small hit to output quality.
I built the LLM Quantization Gallery as an interactive way to explore these techniques β comparing methods, quality metrics, and tradeoffs side by side. Here's the thinking behind it.
Why Quantization Matters
At Dell's CSG CTO Lab, a significant chunk of my work involves making LLM inference faster and cheaper. We deal with questions like: can we run a 70B model on a single node? Can we cut latency by 2Γ without retraining? Quantization is almost always part of the answer.
But quantization isn't a single thing β it's a family of techniques with very different characteristics. The difference between INT8 and INT4 isn't just a number; it involves different accuracy tradeoffs, different hardware support, and sometimes entirely different calibration pipelines. I found myself constantly wanting a single place to compare them visually. So I made one.
The Main Methods, Explained
INT8 β The Safe Default
Integer 8-bit quantization is the most widely supported and least risky option. Most modern GPUs handle INT8 natively with dedicated Tensor Core support. You typically get a ~2Γ memory reduction with minimal perplexity loss (<0.5 points on most benchmarks). It's the right choice when you want a quick win without touching training.
INT4 β High Compression, More Risk
Going to 4 bits gives you ~4Γ compression but the quality gap widens. The key insight from papers like GPTQ and AWQ is that naive rounding doesn't work at 4 bits β you need smarter calibration to preserve the weights that matter most.
GPTQ β Post-Training, Layer by Layer
GPTQ (Generalized Post-Training Quantization) quantizes weights one layer at a time, using a small calibration dataset to minimize reconstruction error. It's offline and one-shot β no retraining required. The resulting model runs at INT4 weights but often achieves near-INT8 quality. This is what powers many of the quantized Llama and Mistral variants you see in the wild.
AWQ β Activation-Aware Weighting
AWQ (Activation-Aware Weight Quantization) takes a different approach: instead of treating all weights equally, it identifies the weights that have the most impact on activations and protects them. It scales those salient weights before quantization, which preserves quality better than GPTQ in some regimes β especially at very low bit widths. AWQ has become popular because it also enables efficient W4A16 kernels (4-bit weights, 16-bit activations).
GGUF / llama.cpp β CPU-Friendly Formats
GGUF is the format used by llama.cpp and is designed for efficient inference on CPUs and Apple Silicon. It supports a range of quantization levels (Q2_K through Q8_0) and mixes different bit widths per layer. Not technically a quantization algorithm itself, but the ecosystem around it has made quantized LLM inference accessible to anyone with a laptop.
QLoRA β Quantize for Fine-Tuning
QLoRA changed how people think about fine-tuning large models. The idea: quantize the base model to NF4 (a 4-bit normal float format), load it frozen, and train only small LoRA adapters in 16-bit. You get the memory footprint of a 4-bit model with the learning capacity of 16-bit adapters. Fine-tuning a 65B model on a single 48GB GPU became possible.
Quality vs. Compression: The Core Tradeoff
The table below summarizes how these methods compare on the key dimensions:
| Method | Bits | Compression | Quality Loss | Calibration | Use Case |
|---|---|---|---|---|---|
| FP16 / BF16 | 16 | 2Γ vs FP32 | None | None | Baseline |
| INT8 | 8 | 4Γ vs FP32 | Minimal | Simple | Production serving |
| GPTQ | 4 | 8Γ vs FP32 | Lowβmoderate | Layerwise | Offline inference |
| AWQ | 4 | 8Γ vs FP32 | Low | Activation stats | Efficient serving |
| QLoRA (NF4) | 4 | 8Γ vs FP32 | Low (for fine-tuning) | None | Fine-tuning |
| GGUF Q4_K_M | ~4 | ~8Γ | Lowβmoderate | Mixed | CPU / edge inference |
What the Gallery Shows
The LLM Quantization Gallery lets you explore these tradeoffs interactively. You can compare:
- Perplexity scores across quantization levels for different model families
- Memory footprint estimates for a given model size and bit width
- Throughput benchmarks (tokens/second) at different precisions
- Visual quality comparisons on common benchmarks (MMLU, HellaSwag, ARC)
The goal isn't to declare a winner β it's to make the tradeoffs legible so you can make an informed decision for your use case.
Lessons from Building It
A few things I learned while putting this together:
- Perplexity is necessary but not sufficient. A model can have low perplexity on WikiText-103 and still noticeably degrade on instruction-following or multi-step reasoning. Always evaluate on your actual task.
- Hardware matters enormously. INT8 on an A100 is very different from INT8 on a consumer GPU. Some methods (especially W4A16 kernels for AWQ) require specific GPU architectures to see real speedups.
- Outlier weights are the enemy. Both GPTQ and AWQ spend a lot of effort managing the small fraction of weights with very large magnitudes. These outliers disproportionately affect quantization error.
- Mixed precision often wins. Quantizing every layer the same way is rarely optimal. Keeping the first and last few layers at higher precision, and quantizing the middle layers more aggressively, tends to preserve quality better.
What's Next
I'm actively working on RL-based approaches to quantization at Dell β using reinforcement learning to decide which layers to quantize and how aggressively, rather than applying the same scheme everywhere. Early results show 2.6Γ compression with significantly lower perplexity degradation than fixed-precision approaches. More on that when the paper is ready.
Try the gallery: arpitsinghgautam.me/llm-quantization-gallery
Questions or feedback? @Asg_Wolverine on Twitter or email me.