REALM: Retrieval-Augmented Language Model Pre-Training

REALM (Retrieval-Augmented Language Model Pre-Training), published by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang of Google in February 2020, was one of the first systems to bake document retrieval directly into how a language model is trained. Standard language models store everything they know inside their weights, which means adding knowledge requires a bigger model. REALM instead pairs the model with a separate retriever that can pull documents from a large corpus such as Wikipedia at both training and inference time.

The paper’s central technical achievement was showing how to pre-train the retriever without labeled data. It used the ordinary masked-language-modeling objective (predicting hidden words) as a learning signal and backpropagated through a retrieval step that ranked millions of documents, so the retriever gradually learned to fetch passages that actually help fill in the blanks. This made the knowledge component explicit and modular rather than buried in parameters.

On open-domain question answering, REALM outperformed prior methods by 4 to 16 percent absolute accuracy across three benchmarks, while also offering interpretability: you can see which documents the model consulted.

For a business reader, REALM is the research root of the now-standard idea that an AI system can be small and current if it is allowed to look information up, rather than large and frozen because it had to memorize everything in advance.

REALM: Retrieval-Augmented Language Model Pre-Training

Sources

Related