Efficient Streaming Language Models with Attention Sinks

StreamingLLM, published by Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis in September 2023, addresses how to keep a language model running over an essentially endless stream of input, such as a long multi-turn conversation, without exhausting memory or retraining the model. A naive fix is a sliding window that only keeps the most recent tokens in the key-value cache, but the authors found this collapses model quality once the earliest tokens are dropped.

The reason is a phenomenon they call attention sinks: the very first tokens in a sequence receive a disproportionate share of attention regardless of whether they are semantically important, acting as a stabilizing anchor. The fix is surprisingly simple. By keeping the key-value entries of a handful of initial tokens, plus the recent window, the model recovers the quality of full attention while still discarding the bulk of old tokens. With this, StreamingLLM handled sequences up to 4 million tokens and ran up to 22 times faster than a recompute baseline in streaming settings.

For a business, StreamingLLM is part of why a chat assistant can stay coherent across very long sessions without ballooning memory cost, a practical enabler for always-on conversational systems.

Efficient Streaming Language Models with Attention Sinks

Sources

Related