The LLM Quantization Gallery is an interactive tool for exploring and comparing quantization techniques for large language models. It provides a hands-on way to understand how different quantization methods affect model quality, size, and inference efficiency.
Quantization is one of the most impactful techniques for deploying LLMs efficiently — reducing memory footprint and accelerating inference by representing weights at lower bit precisions. This gallery makes those trade-offs concrete and navigable.
What It Covers
- Comparison of quantization methods (INT8, INT4, GPTQ, AWQ, GGUF, and more)
- Quality metrics (perplexity, benchmark scores) across bit widths
- Model size and memory footprint comparisons
- Inference speed benchmarks across hardware configurations
Background
This project grew out of my work at Dell Technologies (CSG CTO Lab), where I developed an RL-based quantization framework for Post-Training Quantization in LLMs — achieving 2.6× compression over baseline methods with minimal perplexity loss. The gallery is a public-facing complement to that research, making quantization trade-offs accessible to practitioners.
Open the Gallery