VQA: Visual Question Answering

The VQA paper, posted in May 2015 by Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, and Devi Parikh, defined the task of free-form, open-ended Visual Question Answering: given an image and a natural-language question about it, produce a natural-language answer. It set the agenda for vision-language research for years.

VQA deliberately demanded more than captioning. A caption describes the obvious foreground; a question can probe anything - the color of a small object, the number of people, what might happen next, or commonsense about the scene. Answering well therefore required detailed image understanding plus reasoning, and often knowledge beyond the pixels. The authors paired the task with a large dataset - roughly 0.25 million images, 0.76 million questions, and about 10 million answers (multiple humans answered each question) - and designed it so answers were short enough to evaluate automatically, with a multiple-choice variant alongside the open-ended one.

The task framing endured: VQAv2 became a standard benchmark, and a model’s VQA score remained a headline metric for systems like Flamingo and BLIP-2 years later.

Why business readers should care: VQA crystallized the goal of machines that answer arbitrary questions about images - the capability behind visual assistants for blind users, document and chart understanding, and the image side of multimodal chatbots. It also set an early example of building a measurable benchmark to drive a whole research direction.

Sources

Last verified June 7, 2026