Make-A-Video: Text-to-Video Generation without Text-Video Data

“Make-A-Video: Text-to-Video Generation without Text-Video Data,” posted to arXiv on September 29, 2022 by Uriel Singer and colleagues at Meta AI, was one of the first systems to show convincing text-to-video generation. It confronted the same data shortage that hampered text-to-3D work: high-quality datasets pairing text descriptions with video clips were scarce.

The solution was to split the problem in two. Make-A-Video learned what the world looks like and how text maps to imagery from abundant text-image pairs, and it learned how things move from large collections of unlabeled video that needed no captions. The system added spatial-temporal modules to a text-to-image backbone and used a multi-stage pipeline, with separate components to interpolate between frames for smoother motion and to upscale the output for higher resolution. This let it produce short videos from a text prompt without ever training on matched text-video data.

Make-A-Video, alongside contemporaneous efforts, opened the text-to-video era that would later culminate in systems like Sora and commercial video generators. For a general reader, it illustrates a recurring and powerful strategy in generative AI: when you lack paired data for exactly the task you want, decompose the task so each part can be learned from the plentiful data that does exist.

Sources

Last verified June 7, 2026