AI Models Collapse When Trained on Recursively Generated Data

“AI models collapse when trained on recursively generated data,” by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal, appeared in Nature (volume 631, issue 8022, pages 755-759) in July 2024. An earlier preprint circulated in 2023 under the title “The Curse of Recursion: Training on Generated Data Makes Models Forget.” The paper studies what happens when generative models are trained on the output of previous generative models - an increasingly realistic scenario as AI-generated text and images fill the web that future models scrape.

The central finding is a degenerative process the authors call model collapse: “indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear.” Rare events, unusual phrasings, and minority patterns are sampled less and less with each generation of model, until later models converge toward a narrow, low-variety approximation of the data and lose touch with the true distribution. The authors show the effect is not specific to language models - it also appears in variational autoencoders and Gaussian mixture models - which suggests it is a general property of learning from one’s own approximations.

The practical implication the paper stresses is the rising value of genuinely human-produced data. As the open web becomes saturated with synthetic content, the provenance of training data - whether it traces back to real human activity or to earlier model output - becomes a determinant of model quality.

Why business readers should care: model collapse is a concrete reason that “just scrape more of the internet” stops working once the internet is full of AI output. It puts a premium on access to authentic, human-generated data and on knowing where training data actually comes from - turning data provenance into a competitive asset.

AI Models Collapse When Trained on Recursively Generated Data

Sources

Related