“U-Net: Convolutional Networks for Biomedical Image Segmentation,” posted to arXiv in May 2015 by Olaf Ronneberger, Philipp Fischer, and Thomas Brox at the University of Freiburg, solved a different vision problem from classification or detection: labeling every single pixel of an image. In biomedical imaging this means tracing the exact outline of a cell or tissue, not just saying it is present.
The architecture is named for its U shape. A contracting path of convolutions and downsampling captures what is in the image at increasingly coarse scales, and a symmetric expanding path of upsampling rebuilds a full-resolution output. The key trick is skip connections that copy fine-grained detail from the contracting side directly across to the expanding side, so the network can place its labels precisely while still using high-level context. Combined with heavy data augmentation, U-Net could be trained from only a few dozen annotated images and won the ISBI cell-tracking challenge in 2015 by a wide margin.
U-Net became one of the most reused architectures in all of deep learning. It is the standard for medical image segmentation, satellite and microscopy analysis, and many scientific imaging tasks. Less obviously, the U-Net shape became the default backbone inside modern diffusion image generators such as Stable Diffusion, where a U-Net is trained to predict and remove noise step by step. A network designed to outline cells turned out to be the engine that paints the images behind today’s text-to-image models.