“Masked Autoencoders Are Scalable Vision Learners” was submitted to arXiv in November 2021 by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, and Ross Girshick at Facebook AI Research, now Meta AI. It brought to vision the kind of self-supervised pretraining that masked language modeling had brought to text.
The idea is to learn from images without labels by hiding most of each image and asking the network to fill in what is missing. MAE masks a very high fraction of the image patches, around 75 percent, and trains a vision transformer to reconstruct the missing pixels. Two design choices make it efficient. The encoder, the part that does the heavy lifting, processes only the small visible portion of the image, so pretraining is fast. A lightweight decoder then takes the encoder’s output plus placeholder tokens for the masked spots and reconstructs the full image. Because so much is hidden, the task forces the network to learn genuine structure about objects and scenes rather than copying nearby pixels. After this pretraining the encoder transfers extremely well, and the authors reported strong results, including 87.8 percent ImageNet accuracy with a large model, while training several times faster than alternatives.
MAE became a standard way to pretrain vision transformers, especially when labeled data is scarce, and it reinforced the broader lesson that learning to reconstruct masked inputs is a powerful, general route to good representations across modalities.
For a general reader, MAE shows how AI systems can teach themselves about the visual world from raw, unlabeled images, the same self-supervised trick that powers large language models, applied to seeing.