Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show, Attend and Tell, published in February 2015 by Kelvin Xu, Jimmy Ba, Ryan Kiros, and co-authors including Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, and Yoshua Bengio, extended neural image captioning with a visual attention mechanism. Where the earlier Show and Tell compressed the whole image into a single vector, this model let the caption generator look at different regions of the image as it produced each word.

The paper introduced two flavors of attention: a soft version, fully differentiable and trained by standard backpropagation, and a hard version, trained stochastically by maximizing a variational lower bound. Both let the model learn, without being told, to fix its gaze on the salient object while emitting the corresponding word. A memorable contribution was visualizing this gaze - the attention maps show the model focusing on a frisbee as it writes the word frisbee - giving an early, interpretable window into what a network was attending to.

The approach set state-of-the-art results on Flickr8k, Flickr30k, and MS COCO. Coming just months after attention was introduced for machine translation, it showed the same idea worked across modalities.

Why business readers should care: visual attention is an early, intuitive demonstration of the attention mechanism that now powers all transformer models. Its attention visualizations were also one of the first practical interpretability tools, letting people see where a vision model was looking when it made a decision.

Sources

Last verified June 7, 2026