Dolma: An Open Corpus of Three Trillion Tokens

“Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research” was submitted to arXiv on January 31, 2024 by Luca Soldaini and 35 co-authors at the Allen Institute for AI (AI2). Dolma is the pretraining dataset behind AI2’s OLMo language models, and it was released to fill a specific gap: while many models are open-weight, the data that shaped them is almost always secret. As the paper notes, “commercial models rarely detail their data.”

Dolma is a three-trillion-token English corpus drawn from a mix of web content, scientific papers, code, public-domain books, social media, and encyclopedic material. The authors did not just publish the data; they documented its design principles and construction in detail and open-sourced the curation toolkit used to produce it, so that others can audit the choices and reproduce or modify the pipeline.

The point of Dolma is reproducible science. If researchers cannot see the training data, they cannot study how data choices affect model behavior, bias, or capability. By making the corpus and the tools that built it fully open, AI2 turned pretraining data from a trade secret into a shared research object. For organizations weighing open versus closed models, Dolma is a concrete example of what genuine data transparency looks like, and why so few large models offer it.

Dolma: An Open Corpus of Three Trillion Tokens

Sources

Related