Common Crawl has collected petabytes of web data since 2008

Common Crawl’s overview page describes its corpus as containing “petabytes of data, regularly collected since 2008.” The nonprofit hosts this web archive on Amazon Web Services and makes it freely available to download or analyze in the cloud. This corpus is one of the most common starting points for the text used to train large language models.

Sources

Last verified June 6, 2026