LLM Quantization Gallery: Making Sense of How We Shrink Large Language Models

April 2026 8 min read LLMs · Quantization · Systems

Large language models are powerful, but they come with a problem that's easy to understate: they're enormous. A 70B-parameter model in full float32 precision takes roughly 280 GB of memory - more than most GPU clusters at a typical research lab. Even in bfloat16, that's still 140 GB. Running these models at any kind of scale requires us to get creative.

Quantization is one of the most practical solutions we have. The idea is simple: instead of storing each weight as a 32-bit float, we compress it down to 8, 4, or even fewer bits. Done carefully, you can compress a model by 2–8× with only a small hit to output quality.

I built the LLM Quantization Gallery as an interactive way to explore these techniques - comparing methods, quality metrics, and tradeoffs side by side. Here's the thinking behind it.

Why Quantization Matters

At Dell's CSG CTO Lab, a significant chunk of my work involves making LLM inference faster and cheaper. We deal with questions like: can we run a 70B model on a single node? Can we cut latency by 2× without retraining? Quantization is almost always part of the answer.

But quantization isn't a single thing - it's a family of techniques with very different characteristics. The difference between INT8 and INT4 isn't just a number; it involves different accuracy tradeoffs, different hardware support, and sometimes entirely different calibration pipelines. I found myself constantly wanting a single place to compare them visually. So I made one.

The Main Methods, Explained

INT8 - The Safe Default

Integer 8-bit quantization is the most widely supported and least risky option. Most modern GPUs handle INT8 natively with dedicated Tensor Core support. You typically get a ~2× memory reduction with minimal perplexity loss (<0.5 points on most benchmarks). It's the right choice when you want a quick win without touching training.

INT4 - High Compression, More Risk

Going to 4 bits gives you ~4× compression but the quality gap widens. The key insight from papers like GPTQ and AWQ is that naive rounding doesn't work at 4 bits - you need smarter calibration to preserve the weights that matter most.

GPTQ - Post-Training, Layer by Layer

GPTQ (Generalized Post-Training Quantization) quantizes weights one layer at a time, using a small calibration dataset to minimize reconstruction error. It's offline and one-shot - no retraining required. The resulting model runs at INT4 weights but often achieves near-INT8 quality. This is what powers many of the quantized Llama and Mistral variants you see in the wild.

AWQ - Activation-Aware Weighting

AWQ (Activation-Aware Weight Quantization) takes a different approach: instead of treating all weights equally, it identifies the weights that have the most impact on activations and protects them. It scales those salient weights before quantization, which preserves quality better than GPTQ in some regimes - especially at very low bit widths. AWQ has become popular because it also enables efficient W4A16 kernels (4-bit weights, 16-bit activations).

GGUF / llama.cpp - CPU-Friendly Formats

GGUF is the format used by llama.cpp and is designed for efficient inference on CPUs and Apple Silicon. It supports a range of quantization levels (Q2_K through Q8_0) and mixes different bit widths per layer. Not technically a quantization algorithm itself, but the ecosystem around it has made quantized LLM inference accessible to anyone with a laptop.

QLoRA - Quantize for Fine-Tuning

QLoRA changed how people think about fine-tuning large models. The idea: quantize the base model to NF4 (a 4-bit normal float format), load it frozen, and train only small LoRA adapters in 16-bit. You get the memory footprint of a 4-bit model with the learning capacity of 16-bit adapters. Fine-tuning a 65B model on a single 48GB GPU became possible.

Quality vs. Compression: The Core Tradeoff

The table below summarizes how these methods compare on the key dimensions:

Method	Bits	Compression	Quality Loss	Calibration	Use Case
FP16 / BF16	16	2× vs FP32	None	None	Baseline
INT8	8	4× vs FP32	Minimal	Simple	Production serving
GPTQ	4	8× vs FP32	Low–moderate	Layerwise	Offline inference
AWQ	4	8× vs FP32	Low	Activation stats	Efficient serving
QLoRA (NF4)	4	8× vs FP32	Low (for fine-tuning)	None	Fine-tuning
GGUF Q4_K_M	~4	~8×	Low–moderate	Mixed	CPU / edge inference

What the Gallery Shows

The LLM Quantization Gallery lets you explore these tradeoffs interactively. You can compare:

Perplexity scores across quantization levels for different model families
Memory footprint estimates for a given model size and bit width
Throughput benchmarks (tokens/second) at different precisions
Visual quality comparisons on common benchmarks (MMLU, HellaSwag, ARC)

The goal isn't to declare a winner - it's to make the tradeoffs legible so you can make an informed decision for your use case.

Lessons from Building It

A few things I learned while putting this together:

Perplexity is necessary but not sufficient. A model can have low perplexity on WikiText-103 and still noticeably degrade on instruction-following or multi-step reasoning. Always evaluate on your actual task.
Hardware matters enormously. INT8 on an A100 is very different from INT8 on a consumer GPU. Some methods (especially W4A16 kernels for AWQ) require specific GPU architectures to see real speedups.
Outlier weights are the enemy. Both GPTQ and AWQ spend a lot of effort managing the small fraction of weights with very large magnitudes. These outliers disproportionately affect quantization error.
Mixed precision often wins. Quantizing every layer the same way is rarely optimal. Keeping the first and last few layers at higher precision, and quantizing the middle layers more aggressively, tends to preserve quality better.

What's Next

I'm actively working on RL-based approaches to quantization at Dell - using reinforcement learning to decide which layers to quantize and how aggressively, rather than applying the same scheme everywhere. Early results show 2.6× compression with significantly lower perplexity degradation than fixed-precision approaches. More on that when the paper is ready.

Try the gallery: arpitsinghgautam.me/llm-quantization-gallery
Questions or feedback? @Asg_Wolverine on Twitter or email me.