The Data Provenance Initiative

“The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing and Attribution in AI” was posted to arXiv on October 25, 2023 by Shayne Longpre and 16 co-authors from a mix of legal and machine learning backgrounds. The project set out to do something the field had largely skipped: systematically trace where popular training datasets actually come from, who made them, and what license terms govern their use.

The team audited more than 1,800 text datasets, building tools and standards to follow each one from source through creators, license conditions, and downstream use. A central finding was pervasive license misattribution - datasets being shared and reused under terms that did not match their true licensing - which undermines the assumption that training data was lawfully and clearly licensed. The landscape analysis also showed sharp divides between commercially open and closed datasets, with closed collections dominating categories like lower-resource languages, creative tasks, and newer or synthetic data.

The work matters because so much of the AI training stack rests on shaky paperwork. If the provenance and licensing of the data are wrong or unknown, every model built on it inherits that uncertainty. For organizations adopting AI, the Data Provenance Initiative is a direct argument for demanding clear data lineage, the same way they would for any other supply chain that carries legal exposure.

Sources

Last verified June 7, 2026