GLIDE: Text-Guided Diffusion for Image Generation and Editing

“GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models,” posted to arXiv on December 20, 2021 by Alex Nichol and colleagues at OpenAI, was an important step on the path from the original DALL-E to DALL-E 2. It demonstrated that diffusion models, guided by text, could generate photorealistic images and also edit existing ones from written instructions.

The paper compared two ways of steering generation toward a text prompt: guidance from a CLIP model and the simpler classifier-free guidance. It found that classifier-free guidance produced images that human evaluators preferred and that were judged more photorealistic and better matched to the caption than the original DALL-E, despite GLIDE using a smaller model. The authors also showed that GLIDE could be fine-tuned for inpainting, letting a user mark a region of an image and describe what should appear there, enabling text-driven editing rather than only generation from scratch.

GLIDE established the recipe, text-conditioned diffusion plus classifier-free guidance, that the next wave of image generators would adopt, and OpenAI built DALL-E 2 on closely related ideas. For a general reader, GLIDE marks the moment OpenAI’s image work shifted from the token-based original DALL-E toward the diffusion approach that powers most of today’s text-to-image tools, and it previewed the now-common feature of editing photos by simply describing the change.

GLIDE: Text-Guided Diffusion for Image Generation and Editing

Sources

Related