LAION-5B was a hugely influential open dataset of roughly 5.85 billion image-text pairs scraped from the web, and it was used to train popular image generators including the Stable Diffusion models. On December 20, 2023, the Stanford Internet Observatory published a report by David Thiel, “Identifying and Eliminating CSAM in Generative ML Training Data and Models,” that examined the dataset for child sexual abuse material (CSAM).
Using a combination of PhotoDNA perceptual hashing, cryptographic hash matching, nearest-neighbor queries, and machine-learning classifiers, the study “detected many hundreds of instances of known CSAM in the training set, as well as many new candidates that were subsequently verified by outside parties.” The report gave recommendations for handling existing copies of the data, building future datasets more safely, and modifying models already trained on the contaminated corpus. In response, LAION took its datasets offline pending review, later republishing a cleaned version.
This belongs in the cautionary file because it is a stark failure of the web-scraping pipeline. A dataset assembled by indiscriminately crawling the open internet had absorbed illegal and deeply harmful material that nobody had vetted, and only an outside audit caught it - after the data had already trained widely used models. For a business reader, the episode is a blunt warning: “scraped from the public web” is not a guarantee of safety or legality, and models inherit whatever their training data contains, including content no one would knowingly include.