RedPajama Reproduces the LLaMA Training Data

On April 17, 2023, Together announced RedPajama, an effort to build fully open large language models that began by reproducing the training dataset behind Meta’s LLaMA. LLaMA’s paper had described its data recipe but the dataset itself was not released, so RedPajama set out to recreate “the LLaMA training dataset of over 1.2 trillion tokens” from public sources. The project was a collaboration among Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research.

The released dataset mirrors LLaMA’s described mixture across seven slices: CommonCrawl (878 billion tokens), C4 (175 billion), GitHub (59 billion), Books (26 billion), arXiv (28 billion), Wikipedia (24 billion), and StackExchange (20 billion). Both the full corpus and a smaller sample were published on Hugging Face, with the preparation scripts released under Apache 2.0, so the recipe was open as well as the data.

RedPajama mattered because it removed a key barrier to open AI: even when model weights leak or release, the training data usually does not, leaving outsiders unable to study or rebuild a model from scratch. By reconstructing a frontier-scale training set in the open, RedPajama gave the research community a transparent foundation and seeded later open datasets and models. For a business reader, it is part of the broader story of open alternatives narrowing the gap with closed, proprietary AI.

RedPajama Reproduces the LLaMA Training Data

Sources

Related