Big Bird: Transformers for Longer Sequences

Big Bird, published by Manzil Zaheer, Guru Guruganesh, and colleagues at Google in July 2020, is a transformer built to handle far longer sequences than models like BERT. The obstacle is the same quadratic cost of full attention that limits all standard transformers; Big Bird reduces it to linear with a carefully chosen sparse attention pattern, enabling sequences roughly eight times longer than was previously practical on the same hardware.

Big Bird’s attention combines three patterns: local window attention so each token sees its neighbors, random attention so each token connects to a few randomly chosen positions, and global attention through special tokens that attend to the entire sequence. The authors also proved a theoretical point, showing that this sparse attention is still a universal approximator of sequence functions and is Turing complete, so the efficiency gain does not sacrifice expressive power. They demonstrated improvements on question answering and summarization and even applied it to genomics sequences.

Together with Longformer, Big Bird helped legitimize sparse attention as the route to long context before dense long-context models took over.

For a business, Big Bird is part of why models can now reason over long inputs such as full documents or even DNA sequences, opening applications that short-context models simply could not address.

Big Bird: Transformers for Longer Sequences

Sources

Related