OpenAI releases CLIP

On January 5, 2021, OpenAI released CLIP (Contrastive Language-Image Pre-training) alongside the original DALL-E. The accompanying paper, “Learning Transferable Visual Models From Natural Language Supervision” by Alec Radford and colleagues, was submitted to arXiv on February 26, 2021. CLIP was trained on “a dataset of 400 million (image, text) pairs collected from the internet,” learning to predict which caption goes with which image.

Rather than training on a fixed set of labeled categories the way earlier vision systems did, CLIP learned visual concepts from the words people naturally use to describe pictures. This let it perform zero-shot transfer across more than 30 computer-vision datasets without task-specific training, matching a strong supervised baseline on ImageNet without using any of that benchmark’s 1.28 million training examples.

CLIP mattered because it provided the shared image-and-text representation that much of modern multimodal AI is built on. DALL-E 2’s image generator was guided by CLIP, and the broader idea of aligning pictures and language in one space underpins the vision capabilities of today’s frontier models. The multimodality concept entry in this library treats CLIP as a foundational primary; this entry records the milestone-shaped event of its release.

Sources

Last verified June 6, 2026