ALiBi: Train Short, Test Long with Attention with Linear Biases

ALiBi (Attention with Linear Biases), published by Ofir Press, Noah A. Smith, and Mike Lewis in August 2021, tackles a recurring problem: transformers trained on sequences of one length usually degrade badly when asked to process longer ones at inference. Most models inject position information by adding learned or sinusoidal vectors to token embeddings, an approach that does not extrapolate well past the training length.

ALiBi takes a simpler route. It adds no positional embeddings at all and instead applies a penalty to attention scores that grows linearly with the distance between two tokens, so a token pays progressively less attention to far-away tokens. Because this bias is defined by distance rather than by absolute positions seen during training, the pattern keeps working on longer sequences. The authors showed a 1.3-billion-parameter model trained on 1024-token inputs could process 2048-token inputs at the same quality as a model trained directly on that length, while also training 11 percent faster and using 11 percent less memory.

ALiBi was one of the influential early answers to the length-extrapolation question, alongside rotary embeddings, and it shaped how later long-context models think about position. For a business, it is part of the foundation that lets modern models read documents far longer than anything in their training data.

ALiBi: Train Short, Test Long with Attention with Linear Biases

Sources

Related