Quantization

Quantization is the practice of storing a model’s internal numbers (its weights) using fewer bits of precision. A model is normally trained with high-precision numbers - 16 or 32 bits each - but those can often be rounded down to 8, 4, or even fewer bits with surprisingly little loss of quality. The payoff is large: a smaller memory footprint, lower cost, and faster inference, sometimes enough to run a model that previously needed a data-center GPU on a single consumer machine.

Two papers are foundational. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” by Tim Dettmers and colleagues (2022) showed how to cut inference memory roughly in half while “retaining full precision performance,” using a clever trick: most values are handled in 8-bit, while a small number of unusual high-magnitude “outlier” values are kept in higher precision so accuracy does not collapse. “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers” by Elias Frantar and colleagues (2022) went further, quantizing 175-billion-parameter models to 3 or 4 bits per weight with “negligible accuracy degradation,” enabling such a model to run for inference on a single GPU.

A key distinction is when quantization happens. Post-training quantization, like GPTQ, compresses a model after it has already been trained - the fast, common path. Other methods quantize during or alongside training to claw back more accuracy. Either way, the goal is the same: shrink the model for deployment without retraining it from scratch.

Why business readers should care: quantization is the main lever for running capable models cheaply, especially on your own hardware or at the edge. It is the reason open-weight models can run on a laptop. The trade-off is that aggressive quantization can subtly degrade quality, so the right setting depends on how much accuracy your use case can spare.

Sources

Related