Classifier-Free Diffusion Guidance

“Classifier-Free Diffusion Guidance,” posted to arXiv on July 26, 2022 by Jonathan Ho and Tim Salimans of Google, introduced a technique that is now standard in essentially every text-to-image diffusion system. The problem it solved is how to make a diffusion model adhere strongly to a condition, such as a text prompt or a class label, rather than producing something only loosely related.

Earlier methods used classifier guidance, which required training a separate classifier on noisy images and using its gradients to steer generation. That was awkward and fragile. The classifier-free approach removes the extra model entirely. During training, the same diffusion model learns both to generate conditioned on the prompt and, some of the time with the prompt dropped, to generate unconditionally. At generation time, the two predictions are combined, and a guidance scale dials up how far the output is pushed toward the prompt. Higher guidance yields images that match the prompt more faithfully, at some cost to diversity.

This simple, elegant trick is one of the most important practical ingredients in modern generative AI. It is why turning up a guidance setting in tools built on Stable Diffusion, Imagen, or DALL-E makes images hew more tightly to the words you typed. For a general reader, classifier-free guidance is the knob, mostly hidden inside the product, that controls how literally an image generator obeys your prompt.

Classifier-Free Diffusion Guidance

Sources

Related