Visual Genome

Visual Genome was introduced in “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” submitted to arXiv on February 23, 2016 by Ranjay Krishna and colleagues, with Michael Bernstein and Fei-Fei Li as senior authors. Where earlier datasets like ImageNet asked a model to name what is in a picture, Visual Genome was designed to support reasoning about how the contents relate - answering questions like “what vehicle is the person riding?” rather than just labeling objects.

The dataset contains 108,077 images, each carrying unusually dense annotations: on average roughly 21 objects, 18 attributes, and 18 pairwise relationships per image, all gathered through crowdsourcing. Crucially, the annotations are canonicalized to WordNet synsets, so a relationship such as “pulling(horse, carriage)” is grounded in a shared vocabulary. The authors described it as the densest and largest dataset of image descriptions, objects, attributes, relationships, and question-answer pairs of its time.

Visual Genome became foundational for scene-graph generation, visual question answering, and image captioning, and its structured relationship data fed many later vision-language systems. For a general reader, it marks an important shift in what we ask machines to do with images: not just recognize the parts, but understand how they fit together into a scene.

Sources

Last verified June 7, 2026