“Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding,” submitted to arXiv on May 23, 2022 by Chitwan Saharia and colleagues at Google Research, introduced Imagen, a text-to-image diffusion system. Its headline finding was about text encoders rather than image models: a large generic language model trained only on text, specifically T5, turned out to be a surprisingly effective encoder for image generation, and scaling that frozen text encoder improved image quality and prompt alignment more than scaling the diffusion model itself.
Imagen generates a small image conditioned on the T5 text embedding and then uses a cascade of diffusion super-resolution models to upscale it to high resolution. The paper also introduced DrawBench, a structured benchmark of prompts designed to probe compositional and reasoning failures in text-to-image models, since standard image-quality metrics did not capture how well a model followed a complicated instruction.
The system reported a new state-of-the-art FID score of 7.27 on the COCO dataset without ever training on COCO, and human raters preferred Imagen over DALL-E 2, latent diffusion, and VQ-GAN+CLIP in side-by-side comparisons. Google did not release Imagen publicly at the time, citing concerns about misuse and bias, which left the open ecosystem to coalesce around Stable Diffusion instead. The Imagen lineage later fed Google’s product image and video models.
Why business readers should care: Imagen reframed where the intelligence in an image generator lives. Much of the gain came from a better understanding of language, not better pixels, which is why prompt comprehension became a competitive battleground for these tools.