The Pile is an 825 GiB dataset built from 22 sources

fact

The Pile, released by EleutherAI, is described in its paper “The Pile: An 825 GiB Dataset of Diverse Text for Language Modeling” as exactly that: an 825 GiB English text corpus. The paper states it is “constructed from 22 diverse high-quality subsets — both existing and newly constructed,” combining sources such as web text, books, code, and academic papers into a single documented training dataset.

Sources

PRIMARY https://arxiv.org/abs/2101.00027

Last verified June 6, 2026

<- Back to the AI Library

The Pile is an 825 GiB dataset built from 22 sources

Sources

Related