Conceptual Captions

Conceptual Captions was announced by Google AI on September 5, 2018, presented by Piyush Sharma and Radu Soricut at ACL 2018. It is a dataset of roughly 3.3 million image-caption pairs, and its distinguishing feature is how it was built: rather than paying annotators to describe images by hand, the team automatically harvested captions from the alt-text HTML attributes that web authors attach to images.

Turning raw web alt-text into usable captions took heavy filtering. Google screened for image quality, removed undesirable content, kept only descriptive captions, and used image classifiers to check that the text actually matched the picture. It also generalized proper names into concepts - for example, replacing a specific celebrity name with “actor” - to make captions more learnable. The pipeline started from about one billion English web pages with over five billion candidate images and rejected 99.94 percent of them to reach the final set.

Conceptual Captions was an order of magnitude larger than the hand-curated MS-COCO caption set and offered far more variety because it drew from across the open web. It became a standard pretraining corpus for vision-language models and helped establish web alt-text as a scalable source of image-text supervision - the same basic idea later taken to billions of pairs by datasets like LAION. For a business reader, it shows how clever automated filtering can replace expensive manual labeling at scale.

Sources

Related