The Pile, released by EleutherAI, is described in its paper “The Pile: An 825 GiB Dataset of Diverse Text for Language Modeling” as exactly that: an 825 GiB English text corpus. The paper states it is “constructed from 22 diverse high-quality subsets — both existing and newly constructed,” combining sources such as web text, books, code, and academic papers into a single documented training dataset.