Transformer-XL, from “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context” by Zihang Dai and colleagues (submitted January 9, 2019, ACL 2019), addressed a basic limitation of the original transformer: it processes text in fixed-length chunks and cannot carry information across chunk boundaries. That caps how far back the model can look and fragments long documents.
The paper introduces two ideas. A segment-level recurrence mechanism caches and reuses the hidden states from the previous segment, letting context flow from one chunk to the next while preserving order. A new relative positional encoding scheme makes that reuse work correctly, because absolute positions would otherwise become ambiguous when segments are stitched together. The authors report the model “learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers,” sets state-of-the-art perplexity on several benchmarks, and is “up to 1,800+ times faster than vanilla Transformers during evaluation.”
Transformer-XL matters because the struggle to handle long context has shaped much of the field since. Its recurrence-and-relative-position recipe directly fed into XLNet and informed the long-running effort to give language models longer, cheaper memory.