LAION-5B contains 5.85 billion image-text pairs

The LAION-5B paper describes the dataset as “consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.” The remaining roughly 3.5 billion pairs span other languages. CLIP-filtered means each pair was kept only if an automated model judged the image and its text caption to be a good match, a filtering step applied to web-scraped material. This is the dataset used to train open image models including Stable Diffusion.

Sources

Last verified June 6, 2026