Common Crawl has collected petabytes of web data since 2008

fact

Common Crawl’s overview page describes its corpus as containing “petabytes of data, regularly collected since 2008.” The nonprofit hosts this web archive on Amazon Web Services and makes it freely available to download or analyze in the cloud. This corpus is one of the most common starting points for the text used to train large language models.

Sources

PRIMARY https://commoncrawl.org/overview

Last verified June 6, 2026

<- Back to the AI Library

Common Crawl has collected petabytes of web data since 2008

Sources

Related