The Data Wall

The “data wall” is the idea that large language models are on track to run out of the thing they are most hungry for: high-quality human-written text. Modern models improve partly by training on ever more data, but the supply of public human text is finite. The question of when demand catches up to supply is the data wall.

Epoch AI researchers Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn put numbers on it in an analysis first released in 2022 and substantially updated on June 6, 2024. They estimate “the effective stock of quality and repetition adjusted human-generated public text for AI training at around 300 trillion tokens,” and project that models will fully use this stock “between 2026 and 2032, or even earlier if intensely overtrained.” Aggressive overtraining could pull the date toward 2025-2028.

The data wall reframes the future of AI scaling. If simply adding more human text stops being an option, progress has to come from elsewhere: synthetic data, better data efficiency, transfer from data-rich domains like code and video, or new architectures. Each path has risks - synthetic data, for instance, raises concerns about model collapse. For a business reader, the data wall explains why so much current research and dealmaking is about data access, licensing, and synthetic generation: the easy era of free, abundant training text is ending.

Sources

Related