When OpenAI trained GPT-2 it used a corpus it called WebText, scraped from web pages linked from Reddit posts, but it never released the data or the code that built it. To let outside researchers study and reproduce GPT-2, Aaron Gokaslan and Vanya Cohen, then master’s students at Brown University, built an open replica they named OpenWebText. The project page describes it as “an open source effort to reproduce OpenAI’s WebText dataset.”
The construction mirrored OpenAI’s described recipe. The team extracted post URLs from a public Reddit submissions dataset, deduplicated and filtered them for HTML content, downloaded the pages in parallel, and extracted article text with the newspaper Python package. They used Facebook’s FastText to drop non-English pages, removed near-duplicates with locality-sensitive hashing above a 0.5 similarity threshold, and discarded documents shorter than 128 tokens. The result is “38GB of text data (40GB using SI units) from 8,013,769 documents.”
OpenWebText became a widely reused training and benchmarking corpus precisely because it was open where the original was not. Its existence is a small lesson in the economics of AI transparency: when a lab withholds its data, the research community will often reconstruct an approximation, which then takes on a life of its own. For a business reader, it is a reminder that “proprietary training data” is rarely as defensible a moat as it sounds.