Scaling Laws

A scaling law is an empirical relationship showing that a model gets predictably better as you give it more of three ingredients: parameters (model size), training data, and compute (the amount of calculation spent training it). The striking part is the predictability. Rather than improving in fits and starts, performance follows a smooth mathematical curve, which lets researchers forecast how good a much larger model will be before they spend the money to build it.

The foundational primary is the 2020 paper “Scaling Laws for Neural Language Models” by Jared Kaplan and colleagues at OpenAI. They found that a model’s loss (its error) “scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.” A power law means each doubling of resources buys a steady, foreseeable improvement - the empirical justification for the entire era of building ever-larger models.

A crucial refinement came in 2022 with DeepMind’s “Training Compute-Optimal Large Language Models,” the paper that introduced Chinchilla. Its authors argued that the largest models of the day were badly undertrained: for a fixed compute budget, model size and the amount of training data should be scaled together, roughly such that “for every doubling of model size the number of training tokens should also be doubled.” Their 70-billion-parameter Chinchilla, trained on far more data, outperformed much larger models like the 280-billion Gopher and 175-billion GPT-3. This reframed the goal from “biggest model” to “best-balanced use of compute.”

Why business readers should care: scaling laws are why the AI industry invests so heavily in compute and data - the gains have been mathematically forecastable rather than speculative. They also explain why “bigger” is not automatically “better”: the Chinchilla result shows that how you balance model size against training data can matter more than raw parameter count, which affects both the cost and the quality of the models you buy.

Sources

Related