An Image is Worth 16x16 Words (Vision Transformer)

“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” was submitted to arXiv in October 2020 by Alexey Dosovitskiy and colleagues at Google Research. It showed that the Transformer architecture, which had taken over natural language processing, could be applied almost unchanged to images and rival the convolutional networks that had dominated vision since AlexNet.

The idea is disarmingly simple. The Vision Transformer, or ViT, chops an image into a grid of fixed-size patches - 16 by 16 pixels in the headline version - flattens each patch, and treats the resulting sequence exactly as a Transformer treats a sequence of word tokens. There are no convolutions and almost none of the built-in assumptions about locality and translation that CNNs rely on. The network simply learns relationships among patches through self-attention.

The catch is data. With only mid-sized datasets like ImageNet, ViT lagged behind comparable CNNs, because it lacks the convolution’s helpful prior knowledge and must learn everything from scratch. But pretrained on very large datasets - the authors used Google’s internal JFT-300M, with hundreds of millions of images - ViT matched or beat the best convolutional models while using substantially less compute to train. That result helped unify vision and language under one architecture and set the stage for multimodal models and image foundation models that followed.

Sources

Last verified June 7, 2026