Show and Tell: A Neural Image Caption Generator

Show and Tell, published in November 2014 by Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan at Google, was an early end-to-end neural model for generating natural-language captions of images. It framed captioning as a translation problem: instead of translating French into English, it translated an image into a sentence.

The architecture borrowed directly from neural machine translation. A convolutional neural network reads the image and produces a fixed-length feature vector; that vector is handed to an LSTM recurrent network, which generates the caption one word at a time. The whole pipeline trains end to end to maximize the likelihood of the correct caption. This was a clean break from earlier captioning systems that stitched together hand-built detectors and sentence templates.

The results were a large jump. On the Pascal benchmark, BLEU score rose from a prior best of 25 to 59 (human performance is around 69); Flickr30k went from 56 to 66, and on COCO the model reached a BLEU-4 of 27.7, state of the art at the time.

Why business readers should care: Show and Tell is the ancestor of automatic alt-text, photo search by content, and the image-understanding half of today’s multimodal assistants. It established the encoder-decoder pattern - vision model feeds language model - that, with attention and transformers later bolted on, became the standard way machines describe what they see.

Sources

Last verified June 7, 2026