C4 (Colossal Clean Crawled Corpus)

C4, the Colossal Clean Crawled Corpus, was created by Google researchers for the 2019 T5 paper “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel and colleagues. It is a cleaned-up subset of Common Crawl, the free monthly snapshot of the open web. Raw Common Crawl is enormous but noisy, full of menus, boilerplate, error pages, and gibberish; C4 was the team’s attempt to turn it into something usable for training language models.

The cleaning relied on a stack of simple heuristics: keep only lines ending in terminal punctuation, drop pages with too few sentences, discard text containing words from a blocklist, remove obvious code and placeholder text, deduplicate repeated passages, and keep only English (detected automatically). The resulting standard English variant is roughly 305 GB of text. Variants were also released - a version without the blocklist filter, a “no-clean” version of about 2.3 TB, a news-like subset, and a 108-language multilingual edition (mC4).

Because Google released C4 publicly, it became one of the most widely reused open pre-training datasets, a common ingredient or baseline alongside The Pile. Its filtering recipe also became a reference point - both for imitation and for critique, since blocklist filtering was later shown to remove disproportionate amounts of text about and by marginalized groups. C4 sits in a lineage that runs from Common Crawl through The Pile to FineWeb: the ongoing effort to convert the raw web into clean, documented training fuel.

C4 (Colossal Clean Crawled Corpus)

Sources

Related