GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

GPTQ, introduced in late 2022 by Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh, is a method for shrinking the memory footprint of large language models after they have already been trained. Rather than retraining, it performs one-shot post-training quantization, rounding each weight down to just 3 or 4 bits while using approximate second-order information to compensate for the rounding error and preserve accuracy.

The practical payoff is large. The authors show GPTQ can quantize a 175-billion-parameter model in roughly four GPU hours and that the resulting compressed model can run inference on a single high-end GPU rather than a multi-GPU cluster, with inference speedups of around 3.25 times over the 16-bit baseline. Because large model inference is bottlenecked by how fast weights can be read from memory, cutting weights from 16 bits to 4 bits directly translates into faster and cheaper serving.

GPTQ became one of the most widely used quantization techniques in the open-weights ecosystem, and its file formats sit alongside others like GGUF in tools that let people run capable models on consumer hardware. For a business reader, this paper is a key reason a model that once needed a rack of GPUs can now run on a single card or even a laptop, dramatically lowering the cost of deploying AI.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Sources

Related