Deep Double Descent: Where Bigger Models and More Data Hurt

“Deep Double Descent: Where Bigger Models and More Data Hurt” was submitted to arXiv on December 4, 2019 by Preetum Nakkiran, Gal Kaplun, Boaz Barak, Ilya Sutskever, and colleagues at Harvard and OpenAI. It gave a careful empirical account of a phenomenon that contradicts the textbook story of overfitting.

Classical statistics teaches the bias-variance tradeoff: as a model grows more complex, test error falls, bottoms out at some sweet spot, then rises again as the model starts memorizing noise - a U-shaped curve. Modern deep learning seemed to ignore this, with enormous models generalizing well despite having more parameters than data. Double descent reconciles the two. As you increase model size, test error follows the classic U up to the “interpolation threshold,” the point where the model is just barely big enough to fit the training data exactly. Right at that threshold test error spikes - the worst place to be. But keep growing the model past it and test error descends a second time, often to a better minimum than the classical sweet spot. The curve has two descents, not one.

The authors showed the same pattern appears not just as a function of model size but also over training time and, strikingly, with respect to data: in certain regimes adding more training data near the threshold can actually hurt test performance. They introduced a notion of “effective model complexity” to unify these axes.

Double descent helped explain why the field’s instinct to build ever-larger models was not the disaster classical theory predicted - you want to be well past the interpolation threshold, in the second descent, not balanced at the old sweet spot. It sits alongside scaling laws and grokking as part of the modern, still-incomplete understanding of when and why overparameterized networks generalize.

Deep Double Descent: Where Bigger Models and More Data Hurt

Sources

Related