AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

“AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration” was submitted to arXiv on June 1, 2023 by Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, and colleagues at MIT and collaborating institutions, led by Song Han. Its central observation is that not all weights in a language model matter equally: protecting just about one percent of the most important (“salient”) weights can sharply reduce the error introduced by quantization.

The twist is how AWQ decides which weights are salient. Instead of inspecting the weights themselves, it looks at the activation distributions, the actual values flowing through the network, to identify which weight channels carry the most influence, then scales those channels to preserve them during 4-bit quantization. This avoids the need for the slower error-compensation search used by some other methods while keeping accuracy high. The work won the MLSys 2024 Best Paper Award and shipped with an inference framework the authors called TinyChat.

AWQ became, alongside GPTQ, one of the default ways to quantize open-weight models for deployment. For organizations running models on constrained hardware, accurate 4-bit quantization is what makes self-hosting capable models on a single GPU or an edge device economically realistic.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Sources

Related