In January 2020, Jared Kaplan, Sam McCandlish, and colleagues at OpenAI published “Scaling Laws for Neural Language Models.” The paper studied how a language model’s performance depends on three ingredients: the size of the model, the amount of training data, and the compute used to train it.
Their central finding was that performance, measured as loss, improves smoothly and predictably following power-law relationships across these factors, spanning more than seven orders of magnitude. The paper also found that architectural details like the exact width and depth mattered little within reasonable ranges. A key practical conclusion was that larger models are significantly more sample-efficient, so the compute-optimal strategy is to train very large models on a relatively modest amount of data and stop well before full convergence.
This turned scaling from guesswork into engineering. Teams could now estimate, in advance, how much better a model would get if they spent more on size, data, or compute.
These scaling laws became the planning tool behind the largest models that followed, including GPT-3 just months later. They are a major reason the industry confidently invested in ever-larger training runs.