“Deduplicating Training Data Makes Language Models Better” was submitted to arXiv on July 14, 2021 by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. The paper documents a problem hiding in plain sight: standard language-modeling datasets are riddled with near-duplicate examples and long repeated substrings. As a striking example, the authors found that C4, a widely used corpus, contained a single 61-word English sentence repeated more than 60,000 times.
These duplicates have real consequences. The authors show that over 1 percent of the unprompted output of models trained on such data is copied verbatim from the training set. After building tools to deduplicate the corpora, they trained models that emitted memorized text roughly ten times less often, reached the same or better accuracy in fewer training steps, and suffered less train-test overlap - overlap that, left unaddressed, inflated more than 4 percent of standard validation sets and made evaluations misleading.
The finding helped establish deduplication as a routine and important step in modern data pipelines, and it directly shaped later web-scale datasets that lean heavily on aggressive dedup. For a business reader, the takeaway is that data quality, not just data quantity, drives model behavior - and that careless corpora quietly turn into both privacy and accuracy problems.