Are Emergent Abilities of Large Language Models a Mirage?

“Are Emergent Abilities of Large Language Models a Mirage?” was submitted to arXiv on April 28, 2023 by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo of Stanford. It is a direct rebuttal to the 2022 claim that large language models gain certain abilities suddenly and unpredictably as they scale, and it went on to win an Outstanding Paper award at NeurIPS 2023.

The argument is about measurement, not about the models themselves. The authors show that the apparently sharp jumps in capability reported as “emergent” tend to appear only under nonlinear or discontinuous scoring metrics - for instance, exact-match accuracy on a multi-step task, which gives zero credit unless every step is correct. Under such a metric, steady underlying improvement is invisible until the model crosses a threshold, at which point the score leaps. Swap in a smoother, more continuous metric - partial credit, or token-level error - and the same models reveal gradual, predictable improvement with scale, with no discontinuity at all.

They back this up by analyzing the GPT-3 family and by meta-analysis of BIG-bench tasks, finding that alleged emergent abilities largely evaporate when the metric changes or when the statistics are done more carefully. The deflationary reading is that “emergence” was in significant part a property of how researchers chose to score their benchmarks.

Why business readers should care: the debate is not academic hair-splitting. If capabilities truly appear without warning, planning around AI risk and adoption is much harder; if the jumps are mostly measurement artifacts, capability growth is more forecastable. The honest current position is that the choice of metric matters enormously, and that claims of sudden new abilities deserve scrutiny of how they were measured.

Sources

Last verified June 7, 2026