LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

“LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” was submitted to arXiv on August 15, 2022 by Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, and presented at NeurIPS 2022. It showed how to run the large matrix multiplications inside transformer feed-forward and attention layers using 8-bit integers instead of 16-bit floats, cutting the memory needed to load a model roughly in half while preserving full-precision accuracy.

The key problem the paper solved is outlier features. At scale, a small number of dimensions in the activations take on very large values, and naively quantizing them to 8 bits wrecks accuracy. LLM.int8() uses mixed-precision decomposition: it keeps those rare outlier dimensions in 16-bit precision and runs the remaining 99.9 percent of the computation in 8-bit, getting the memory savings without the accuracy hit. The authors open-sourced their implementation, which became the widely used bitsandbytes library.

bitsandbytes turned into core plumbing for the open-weight ecosystem, and the same author’s later QLoRA work built directly on it to make 4-bit fine-tuning practical. For practitioners, this paper is a foundational reason a model that would not otherwise fit in a given GPU’s memory suddenly does.

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Sources

Related