High-Resolution Image Synthesis with Latent Diffusion Models

“High-Resolution Image Synthesis with Latent Diffusion Models,” submitted to arXiv on December 20, 2021 by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer, solved the central cost problem of diffusion image generation. Earlier diffusion models such as DDPM ran the slow denoising process directly on pixels, which made high-resolution training and sampling enormously expensive. The paper’s idea was to first train an autoencoder that compresses an image into a much smaller latent representation, then run the diffusion process entirely in that latent space, and only decode back to pixels at the end.

This “latent diffusion” split the work into a perceptual compression stage and a generative stage, hitting what the authors called a near-optimal point between reducing complexity and preserving detail. It cut training and inference cost dramatically while keeping image quality high. The architecture also added cross-attention layers, which let the model condition on text prompts, bounding boxes, or other inputs, turning it into a general conditional image generator rather than just an unconditional sampler.

The method reached state-of-the-art or competitive results across text-to-image generation, inpainting, super-resolution, and semantic synthesis. Its real significance is downstream: this is the architecture that Stability AI scaled and released to the public in August 2022 as Stable Diffusion, the open model that put high-quality image generation on consumer hardware. The paper was published at CVPR 2022.

Why business readers should care: the move to latent space is why image generation became cheap enough to run on an ordinary graphics card and free enough to spawn an entire ecosystem of tools, fine-tunes, and plugins. The cost structure of a technology often decides who gets to use it.

High-Resolution Image Synthesis with Latent Diffusion Models

Sources

Related