“Rethinking Attention with Performers,” submitted to arXiv on September 30, 2020 by Krzysztof Choromanski and colleagues at Google including Lukasz Kaiser and Adrian Weller, introduced the Performer, a Transformer variant that computes attention in linear rather than quadratic time and memory. Standard self-attention compares every token to every other token, so its cost grows with the square of the sequence length, which becomes the main bottleneck for long inputs.
The Performer’s core mechanism, called FAVOR+ (Fast Attention Via positive Orthogonal Random features), approximates the softmax attention matrix using a random feature map. Rather than forming the full attention matrix and then multiplying, FAVOR+ rewrites the computation so the expensive matrix is never materialized, reducing both time and memory to scale linearly with sequence length. Crucially, the authors proved the approximation is unbiased and accurate, with theoretical guarantees on convergence, rather than relying on heuristics or assumptions that the attention is sparse.
They demonstrated competitive performance across diverse tasks, including image pixel prediction, language modeling, and protein sequence analysis, showing the method generalizes beyond text.
Performers were among the most theoretically grounded entries in the wave of efficient Transformer variants. For practitioners facing long sequences, they offered a principled drop-in alternative to full attention that preserves quality while cutting the cost that otherwise makes long-context modeling impractical.