The RefinedWeb Dataset for Falcon LLM

“The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only” was submitted to arXiv on June 1, 2023 by Guilherme Penedo and colleagues at the Technology Innovation Institute. It challenged a common assumption that high-quality language models needed curated sources - books, academic papers, and hand-selected corpora - mixed into their training data. The authors showed that properly filtered and deduplicated Common Crawl web data, used alone, could train models that significantly outperformed ones trained on The Pile.

To make the case, the team extracted roughly five trillion tokens from Common Crawl through an aggressive filtering and deduplication pipeline, and trained 1.3-billion- and 7.5-billion-parameter models on it. RefinedWeb became the backbone of TII’s Falcon models, and the team publicly released a 600-billion-token slice of the dataset. The paper’s broader lesson - that the filtering pipeline can matter more than the source mix - shaped later large-scale web dataset efforts.

The result fed an ongoing debate about training data: if scrubbed web text suffices, the marginal value of curated and licensed corpora becomes a question of legal and quality margins rather than necessity, with direct implications for copyright disputes over training data.

The RefinedWeb Dataset for Falcon LLM

Sources

Related