ELECTRA, from “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators” by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning (submitted March 23, 2020, ICLR 2020), proposed a more efficient way to pretrain language models. BERT learns by masking out some words and predicting them, but only the masked positions contribute to learning, which wastes most of the input.
ELECTRA introduces “replaced token detection.” A small generator network swaps some words for plausible alternatives, and the main model, the discriminator, learns to judge for every token whether it is original or replaced. Because “the task is defined over all input tokens rather than just the small subset that was masked out,” the model learns from far more signal per example. The payoff is striking efficiency: the authors trained a model on a single GPU for four days that “outperforms GPT (trained using 30x more compute)” on GLUE, and at scale ELECTRA matches RoBERTa and XLNet “while using less than 1/4 of their compute.”
ELECTRA matters because it showed that the design of the pretraining task, not just the model size or data, can dramatically lower the cost of building strong language models. That is good news for anyone who cannot afford a giant compute budget.