Movie Gen: A Cast of Media Foundation Models

“Movie Gen: A Cast of Media Foundation Models,” posted to arXiv on October 17, 2024 by a team of 88 researchers at Meta led by Adam Polyak and Andrew Brown, described a suite of foundation models for generating and editing high-quality video. It was Meta’s bid to match the frontier of generative video set by Sora and the leading commercial systems.

The largest video model has roughly 30 billion parameters, uses a transformer architecture, and can generate up to 16 seconds of 1080p HD video at 16 frames per second across different aspect ratios. Beyond plain text-to-video, the system handles instruction-based video editing, personalized generation that places a specific person from a reference image into a scene, and a separate audio model that produces synchronized sound effects, ambient noise, and music to accompany the visuals. The paper emphasized engineering simplifications to the architecture, latent spaces, and training objectives that made this scale practical.

Movie Gen mattered as a detailed, openly published account of how a frontier video system is actually built, at a time when the leading models were often described only through marketing pages. For a general reader, it signals that generative video is moving past silent clips toward complete short-form media, with audio and editing built in, which is what production use in advertising and entertainment actually requires.

Sources

Last verified June 7, 2026