Language Models are Few-Shot Learners (GPT-3)

“Language Models are Few-Shot Learners” was submitted to arXiv in May 2020 by Tom Brown and a large team at OpenAI. It introduced GPT-3, a Transformer language model with 175 billion parameters - more than ten times larger than any dense language model before it - and it reshaped expectations for what scale alone could buy.

The paper’s central finding was in-context learning. Earlier models, including BERT and GPT-2, were adapted to a new task by fine-tuning - updating their weights on task-specific examples. GPT-3 often needed no weight updates at all. You could simply describe a task in plain language and show it a handful of examples inside the prompt - a “few-shot” demonstration - and the model would infer the pattern and continue it. Translation, arithmetic, question answering, unscrambling words, even writing simple code could be coaxed out of one fixed model by changing only the text you fed it.

What was new was less the architecture, which was a scaled-up GPT-2, than the demonstration that capabilities emerge from size and data. The work built directly on scaling-laws research showing that performance improves predictably as model size, data, and compute grow. GPT-3 made that abstract finding visceral: a single general model could do many things it was never specifically trained to do. The paper also launched prompt engineering as a practical skill and set the stage for ChatGPT two years later.

The honest note is that the authors were candid about the limits. GPT-3 still made things up, lost coherence over long passages, and showed biases absorbed from its training data; few-shot prompting was powerful but unreliable. And the “few-shot learning” framing was debated - the model was not truly learning new skills at inference so much as surfacing patterns already latent in its training. The paper itself flagged these concerns alongside the broader societal risks of such a capable text generator.

Sources

Last verified June 6, 2026