“Image-to-Image Translation with Conditional Adversarial Networks,” posted to arXiv on November 21, 2016 by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, introduced the system widely known as pix2pix. The goal was to translate an image from one representation to another, for example turning a segmentation map into a photo, an edge drawing into a realistic object, or a daytime scene into night.
What made pix2pix notable was its generality. Earlier approaches needed a hand-engineered loss function tailored to each task. pix2pix instead used a conditional GAN, where the network learns not only how to produce the output image but also learns the loss function that judges whether the output looks real and matches the input. Because the adversarial loss is learned, the same architecture could be pointed at many different translation problems without task-specific tuning. The model required paired training data, meaning matched before-and-after examples.
pix2pix became one of the most influential and widely reproduced image generation papers, spawning interactive demos and creative tools and setting up its direct successor, CycleGAN, which removed the need for paired data. For a general reader, pix2pix is a clear illustration of conditional generation: instead of producing a random image, the model produces a specific image that corresponds to what you gave it, a pattern that recurs throughout modern text-to-image and controllable generation systems.