Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)

“Adding Conditional Control to Text-to-Image Diffusion Models,” submitted to arXiv on February 10, 2023 by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala, introduced ControlNet, an architecture that gives spatial control over a pretrained diffusion model. Text prompts are good at describing what should appear but poor at specifying exactly where things go; ControlNet lets a user supply an extra conditioning image - a human pose skeleton, an edge map, a depth map, or a segmentation mask - and have the generated image follow that structure.

The key engineering idea is to lock the large production diffusion model and clone its encoder into a trainable branch that is connected back through zero-initialized convolution layers, called zero convolutions. Because those layers start at zero, the control branch initially contributes nothing and the frozen model behaves exactly as before, then gradually learns to inject the spatial condition during fine-tuning. This prevents the noise of early training from corrupting a model that took enormous resources to build, and it makes the method robust even on small datasets, from under 50,000 images to over a million.

ControlNet spread through the open Stable Diffusion community almost immediately because it turned an unpredictable prompt-only tool into something an artist or designer could actually direct. Holding a pose, matching a layout, or tracing a reference sketch became a single conditioning input. The paper won a best-paper award at ICCV 2023.

Why business readers should care: ControlNet is a clean example of adding a steering wheel to a powerful but uncontrollable model without retraining it. The pattern of freezing an expensive base model and attaching a small trainable adapter shows up across modern AI deployment.

Sources

Last verified June 7, 2026