Scalable Diffusion Models with Transformers (DiT)

“Scalable Diffusion Models with Transformers,” posted to arXiv on December 19, 2022 by William Peebles and Saining Xie, introduced the Diffusion Transformer, or DiT. Until then, diffusion models almost universally used a convolutional U-Net as their backbone. DiT discarded that convention and instead operated a plain transformer on latent image patches, the same patch-based approach that the Vision Transformer had brought to image classification.

The paper’s central finding was about scaling. The authors showed that DiTs with more compute, achieved through greater transformer depth or width or more input tokens, consistently produced lower Frechet Inception Distance, the standard measure of generated-image quality. Their largest model, DiT-XL/2, reached an FID of 2.27 on the standard 256-by-256 ImageNet class-conditional benchmark, a state-of-the-art result at the time. In other words, image generation inherited the clean compute-quality relationship that had made large language models so predictable to improve.

DiT proved consequential well beyond its own benchmarks. The transformer-on-patches recipe became the architectural foundation for the next generation of generative video systems, including OpenAI’s Sora, which treats video as sequences of spacetime patches, and many of the commercial text-to-video models that followed. For a general reader, DiT marks the moment the scaling playbook that powered language models crossed over into visual generation, setting up the rapid progress in image and video synthesis that came after.

Sources

Last verified June 7, 2026