DreamFusion: Text-to-3D using 2D Diffusion

“DreamFusion: Text-to-3D using 2D Diffusion,” posted to arXiv on September 29, 2022 by Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall, solved a problem that had blocked text-to-3D generation: the lack of large datasets of 3D models paired with text. There simply was no equivalent of the billions of captioned images used to train 2D systems.

DreamFusion’s insight was to borrow the knowledge already baked into a pretrained 2D text-to-image diffusion model and use it as a critic for 3D. The method optimizes a Neural Radiance Field, or NeRF, a learned 3D scene representation, so that pictures rendered from any viewpoint look like plausible outputs of the diffusion model for the given text prompt. The authors introduced a technique they called Score Distillation Sampling to make this work, distilling the diffusion model’s guidance into the 3D representation. The result is a viewable, relightable 3D object generated from text alone, without ever training on 3D data.

DreamFusion launched a wave of text-to-3D research and connected two previously separate threads, diffusion-based image generation and neural 3D representations. Score Distillation Sampling in particular became a widely reused tool. For a general reader, DreamFusion is a striking example of transfer: a model that only ever learned about flat images turned out to contain enough understanding of the visual world to help build three-dimensional objects.

DreamFusion: Text-to-3D using 2D Diffusion

Sources

Related