QLoRA: Efficient Finetuning of Quantized LLMs

“QLoRA: Efficient Finetuning of Quantized LLMs” was submitted to arXiv on May 23, 2023 by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer of the University of Washington. It combined two ideas: keeping a large model frozen in 4-bit precision while training only small low-rank adapters on top (the LoRA technique), so the bulky base weights never need to be updated in full precision.

The paper introduces three techniques: 4-bit NormalFloat (NF4), a data type the authors describe as information-theoretically optimal for normally distributed weights; double quantization, which quantizes the quantization constants themselves to save more memory; and paged optimizers to handle memory spikes. Together these let the team finetune a 65-billion-parameter model on a single 48GB GPU while matching full 16-bit finetuning quality. They used the method to train Guanaco, a model family they reported reached 99.3 percent of ChatGPT’s quality on one benchmark after 24 hours of training.

QLoRA democratized model customization. Before it, finetuning a frontier-scale open model required a cluster; after it, a single high-end consumer or cloud GPU sufficed, and an ecosystem of fine-tuned open-weight models followed.

QLoRA: Efficient Finetuning of Quantized LLMs

Sources

Related