Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

“Hierarchical Text-Conditional Image Generation with CLIP Latents,” submitted to arXiv on April 13, 2022 by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen, is the paper behind DALL-E 2. The approach is nicknamed unCLIP because it runs CLIP, OpenAI’s image-text matching model, in reverse: instead of scoring how well an image matches text, it generates an image from a CLIP embedding.

The system has two stages. A prior turns a text caption into a CLIP image embedding, and a diffusion decoder turns that embedding back into a full image. The authors found that explicitly generating an image representation in this way improved image diversity with minimal loss in photorealism or caption fidelity. Because everything is anchored in CLIP’s joint embedding space, the model can also produce semantics-preserving variations of an existing image and perform language-guided edits without task-specific training.

DALL-E 2 was a visible leap over the original 2021 DALL-E in sharpness and coherence, and it arrived in the same window as Google’s Imagen and the latent-diffusion work that became Stable Diffusion, marking 2022 as the year text-to-image generation went mainstream. The decoder used diffusion, and the prior was tested in both autoregressive and diffusion forms, with the diffusion prior proving more efficient and higher quality.

Why business readers should care: unCLIP showed that a strong representation model built for one purpose - matching images to captions - could be inverted into a generator. Reusing existing learned representations, rather than training everything from scratch, is a recurring shortcut in modern AI products.

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2 / unCLIP)

Sources

Related