DataComp

DataComp was introduced in “DataComp: In search of the next generation of multimodal datasets,” posted to arXiv on April 27, 2023 by Samir Gadre and a large group of co-authors. It inverts the usual machine learning competition. Normally researchers fix the dataset and compete on model architecture and training; DataComp fixes the training code and the model, and asks participants to compete on the data - which examples to keep, how to filter, what sources to add.

The benchmark centers on a candidate pool of 12.8 billion image-text pairs drawn from Common Crawl. Participants design filtering or curation strategies, then evaluate their resulting dataset by running standardized CLIP training and testing on 38 downstream tasks. The benchmark runs at multiple scales, from 12.8 million up to 12.8 billion samples, so small teams can experiment cheaply. As a baseline, the authors’ own simple filtering produced DataComp-1B, a 1.4-billion-pair subset that trained a CLIP model to 79.2 percent zero-shot accuracy on ImageNet.

DataComp helped formalize the data-centric view that better datasets, not just bigger models, drive progress - and it gave the field a shared way to measure that. For a business reader, it underscores a shift in where competitive advantage in AI is moving: from model tweaks toward disciplined, measurable data curation.

Sources

Related