LLM Quantization Gallery: Making Sense of How We Shrink Large Language Models

Large language models are powerful, but they come with a problem that's easy to understate: they're enormous. A 70B-parameter model in full float32 precision takes roughly 280 GB of memory β€” more than most GPU clusters at a typical research lab. Even in bfloat16, that's still 140 GB. Running these models at any kind of scale requires us to get creative.

Quantization is one of the most practical solutions we have. The idea is simple: instead of storing each weight as a 32-bit float, we compress it down to 8, 4, or even fewer bits. Done carefully, you can compress a model by 2–8Γ— with only a small hit to output quality.

I built the LLM Quantization Gallery as an interactive way to explore these techniques β€” comparing methods, quality metrics, and tradeoffs side by side. Here's the thinking behind it.

Why Quantization Matters

At Dell's CSG CTO Lab, a significant chunk of my work involves making LLM inference faster and cheaper. We deal with questions like: can we run a 70B model on a single node? Can we cut latency by 2Γ— without retraining? Quantization is almost always part of the answer.

But quantization isn't a single thing β€” it's a family of techniques with very different characteristics. The difference between INT8 and INT4 isn't just a number; it involves different accuracy tradeoffs, different hardware support, and sometimes entirely different calibration pipelines. I found myself constantly wanting a single place to compare them visually. So I made one.

The Main Methods, Explained

INT8 β€” The Safe Default

Integer 8-bit quantization is the most widely supported and least risky option. Most modern GPUs handle INT8 natively with dedicated Tensor Core support. You typically get a ~2Γ— memory reduction with minimal perplexity loss (<0.5 points on most benchmarks). It's the right choice when you want a quick win without touching training.

INT4 β€” High Compression, More Risk

Going to 4 bits gives you ~4Γ— compression but the quality gap widens. The key insight from papers like GPTQ and AWQ is that naive rounding doesn't work at 4 bits β€” you need smarter calibration to preserve the weights that matter most.

GPTQ β€” Post-Training, Layer by Layer

GPTQ (Generalized Post-Training Quantization) quantizes weights one layer at a time, using a small calibration dataset to minimize reconstruction error. It's offline and one-shot β€” no retraining required. The resulting model runs at INT4 weights but often achieves near-INT8 quality. This is what powers many of the quantized Llama and Mistral variants you see in the wild.

AWQ β€” Activation-Aware Weighting

AWQ (Activation-Aware Weight Quantization) takes a different approach: instead of treating all weights equally, it identifies the weights that have the most impact on activations and protects them. It scales those salient weights before quantization, which preserves quality better than GPTQ in some regimes β€” especially at very low bit widths. AWQ has become popular because it also enables efficient W4A16 kernels (4-bit weights, 16-bit activations).

GGUF / llama.cpp β€” CPU-Friendly Formats

GGUF is the format used by llama.cpp and is designed for efficient inference on CPUs and Apple Silicon. It supports a range of quantization levels (Q2_K through Q8_0) and mixes different bit widths per layer. Not technically a quantization algorithm itself, but the ecosystem around it has made quantized LLM inference accessible to anyone with a laptop.

QLoRA β€” Quantize for Fine-Tuning

QLoRA changed how people think about fine-tuning large models. The idea: quantize the base model to NF4 (a 4-bit normal float format), load it frozen, and train only small LoRA adapters in 16-bit. You get the memory footprint of a 4-bit model with the learning capacity of 16-bit adapters. Fine-tuning a 65B model on a single 48GB GPU became possible.

Quality vs. Compression: The Core Tradeoff

The table below summarizes how these methods compare on the key dimensions:

Method Bits Compression Quality Loss Calibration Use Case
FP16 / BF16 16 2Γ— vs FP32 None None Baseline
INT8 8 4Γ— vs FP32 Minimal Simple Production serving
GPTQ 4 8Γ— vs FP32 Low–moderate Layerwise Offline inference
AWQ 4 8Γ— vs FP32 Low Activation stats Efficient serving
QLoRA (NF4) 4 8Γ— vs FP32 Low (for fine-tuning) None Fine-tuning
GGUF Q4_K_M ~4 ~8Γ— Low–moderate Mixed CPU / edge inference

What the Gallery Shows

The LLM Quantization Gallery lets you explore these tradeoffs interactively. You can compare:

The goal isn't to declare a winner β€” it's to make the tradeoffs legible so you can make an informed decision for your use case.

Lessons from Building It

A few things I learned while putting this together:

What's Next

I'm actively working on RL-based approaches to quantization at Dell β€” using reinforcement learning to decide which layers to quantize and how aggressively, rather than applying the same scheme everywhere. Early results show 2.6Γ— compression with significantly lower perplexity degradation than fixed-precision approaches. More on that when the paper is ready.

Try the gallery: arpitsinghgautam.me/llm-quantization-gallery
Questions or feedback? @Asg_Wolverine on Twitter or email me.