Longformer: The Long-Document Transformer

Longformer, published by Iz Beltagy, Matthew E. Peters, and Arman Cohan of the Allen Institute for AI in April 2020, was one of the first transformers designed specifically for long documents. Standard self-attention compares every token to every other token, so its cost grows with the square of the sequence length, which makes processing thousands of tokens prohibitively expensive.

Longformer replaces full attention with a sparse pattern. Most tokens use local windowed attention, looking only at a fixed band of nearby tokens, while a small number of task-chosen positions use global attention that can see the whole sequence. This combination scales linearly with length rather than quadratically and works as a drop-in replacement for ordinary self-attention. Pretrained Longformer outperformed RoBERTa on long-document tasks and set state-of-the-art results on benchmarks such as WikiHop and TriviaQA.

Along with BigBird, Longformer established sparse attention as a practical way to extend transformers to long inputs years before million-token context windows became common.

For a business, Longformer is part of the lineage that made it feasible to run transformers over entire contracts, reports, or transcripts in one pass, instead of chopping them into small pieces and losing the connections between them.

Longformer: The Long-Document Transformer

Sources

Related