XLNet: Generalized Autoregressive Pretraining

XLNet, from “XLNet: Generalized Autoregressive Pretraining for Language Understanding” by Zhilin Yang and colleagues (submitted June 19, 2019), tried to combine the strengths of two competing pretraining styles. Autoregressive models like GPT predict the next word and read text in order, while BERT corrupts the input with masks to read context from both directions. The authors point out that by “relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy,” since the mask tokens never appear at fine-tuning time.

XLNet’s answer is permutation language modeling. It “enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order,” meaning it predicts words in many different orders so each word can see context on both sides without any artificial mask token. It also folds in ideas from Transformer-XL, “the state-of-the-art autoregressive model,” to handle longer context. XLNet outperformed BERT across 20 tasks including question answering, sentiment analysis, and document ranking.

XLNet matters as a clear illustration that there is more than one way to teach a model bidirectional understanding, and that rethinking the training objective can close real gaps left by earlier methods.

Sources

Last verified June 7, 2026