RETRO (Retrieval-Enhanced Transformer), introduced by Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, and colleagues at DeepMind in December 2021, conditions a language model’s predictions on text chunks retrieved from an enormous external corpus. As the model generates, it looks up passages that are similar to the recent context and attends to them through a chunked cross-attention mechanism, combining a frozen BERT-based retriever with a trainable encoder.
The headline result was efficiency. RETRO’s retrieval database held 2 trillion tokens, and with this external memory a relatively small RETRO model matched the quality of GPT-3 and Jurassic-1 on the Pile benchmark while using only about 4 percent of their parameters. The paper also showed that an existing pre-trained transformer could be retrofitted with retrieval cheaply, and that the gains carried over to knowledge-intensive tasks like question answering.
RETRO reframed the scaling debate. Instead of growing a model until it has memorized the internet, you can keep the model modest and give it fast access to a vast text store. For businesses weighing the cost of running large models, RETRO is early evidence that retrieval can substitute for raw parameter count, lowering both training and serving expense.