“Make-A-Video: Text-to-Video Generation without Text-Video Data,” posted to arXiv on September 29, 2022 by Uriel Singer and colleagues at Meta AI, was one of the first systems to show convincing text-to-video generation. It confronted the same data shortage that hampered text-to-3D work: high-quality datasets pairing text descriptions with video clips were scarce.
The solution was to split the problem in two. Make-A-Video learned what the world looks like and how text maps to imagery from abundant text-image pairs, and it learned how things move from large collections of unlabeled video that needed no captions. The system added spatial-temporal modules to a text-to-image backbone and used a multi-stage pipeline, with separate components to interpolate between frames for smoother motion and to upscale the output for higher resolution. This let it produce short videos from a text prompt without ever training on matched text-video data.
Make-A-Video, alongside contemporaneous efforts, opened the text-to-video era that would later culminate in systems like Sora and commercial video generators. For a general reader, it illustrates a recurring and powerful strategy in generative AI: when you lack paired data for exactly the task you want, decompose the task so each part can be learned from the plentiful data that does exist.