BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

“BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” was submitted to arXiv in October 2018 by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova of Google. BERT, short for Bidirectional Encoder Representations from Transformers, set a new standard for how language models are built and used.

BERT’s key idea was to read text in both directions at once. Earlier language models processed words left to right, so when interpreting a word they could see only what came before it. BERT instead looks at the whole sentence on both sides of a word simultaneously. It learns this with a clever self-supervised task called masked language modeling: hide a fraction of the words in a sentence and train the model to fill in the blanks using the surrounding context. Because the training text supplies its own answers, BERT could learn from vast amounts of unlabeled text.

The practical payoff was the pre-train-then-fine-tune recipe. A single BERT model is pre-trained once, at great expense, on a huge text corpus. Anyone can then take that model and fine-tune it cheaply on a specific task - question answering, sentiment classification, named entity recognition - with relatively little labeled data. When it appeared, BERT set records across a broad suite of language understanding benchmarks, and within a couple of years it and its variants were powering features such as Google Search’s understanding of queries.

The honest note is about lineage and scope. BERT was an encoder built on the Transformer introduced the year before, so it stood on that architecture rather than inventing it. And BERT is designed to understand and classify text, not to generate long passages - that role went to the GPT line of decoder models. BERT’s enduring contribution was proving that large-scale pre-training plus fine-tuning was the way forward for natural language processing.

Sources

Last verified June 6, 2026