Scaling Laws for Neural Language Models (Kaplan et al.)

“Scaling Laws for Neural Language Models” was submitted to arXiv on January 23, 2020 by Jared Kaplan, Sam McCandlish, and colleagues at OpenAI, including Tom B. Brown and Dario Amodei. It is the paper that put a quantitative recipe behind the intuition that bigger language models are better, and it directly shaped the decision to build GPT-3 later that year.

The central finding was that a language model’s test loss falls as a smooth power law in three quantities - the number of parameters, the size of the training dataset, and the amount of compute - holding across more than seven orders of magnitude. Architectural details like network width and depth mattered surprisingly little within the ranges tested; what dominated was sheer scale. The authors derived equations predicting how loss depends on each factor and how to divide a fixed compute budget between making the model bigger and training it longer.

Their compute-optimal advice was striking: for a fixed budget you should train very large models on relatively modest amounts of data and stop well before convergence, because large models are more sample-efficient. This conclusion was influential but later qualified. In 2022 DeepMind’s Chinchilla paper re-ran the analysis with a corrected treatment of the learning-rate schedule and concluded that the Kaplan recipe under-trained models on data - that parameters and tokens should scale roughly in proportion. The pair of papers is now usually read together, with Chinchilla the standard reference for compute-optimal training.

Why business readers should care: scaling laws turned model building from guesswork into forecasting. They let labs predict the capability gain from a larger training run before spending the money, which is part of why frontier AI became a capital-intensive race rather than an open research lottery.

Scaling Laws for Neural Language Models (Kaplan et al.)

Sources

Related