“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” was submitted to arXiv in March 2021 by Ze Liu and colleagues at Microsoft Research Asia, and won a best paper award at that year’s ICCV conference. It adapted the vision transformer into a backbone that works as broadly as a convolutional network.
The original Vision Transformer treated an image as a flat sequence of patches and applied self-attention across all of them, which is costly because attention scales with the square of the number of patches, and it produced features at only a single resolution. Swin made two changes. First, it computes self-attention within small local windows rather than across the whole image, which keeps the cost linear in image size. Second, it builds a hierarchy by merging patches as the network deepens, producing feature maps at multiple resolutions like a CNN, which is exactly what detection and segmentation systems need. The clever part is the shifted-window scheme: alternating layers shift the window boundaries so that information flows between neighboring windows, restoring the global view that pure local attention would lose.
Swin became a leading general-purpose vision backbone, setting records on object detection and semantic segmentation as well as classification, and it served as the foil that the later ConvNeXt paper measured itself against. It showed that transformers could match CNNs not just on classification but across the full range of vision tasks.
For a general reader, Swin is the architecture that made transformers practical for everyday computer vision, bridging the gap between the language-model world and the dense, multi-scale demands of seeing.