Stable Video Diffusion: Scaling Latent Video Diffusion Models

“Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets,” posted to arXiv on November 25, 2023 by Andreas Blattmann and colleagues at Stability AI, extended the latent-diffusion approach behind Stable Diffusion from still images to video. The work focused less on a single architectural trick and more on the recipe of data curation and staged training needed to make video generation work reliably.

The authors laid out three distinct training stages: pretraining on images, pretraining on a large curated video dataset, and a final fine-tuning stage on a smaller set of high-quality video. They argued and demonstrated that careful filtering and ordering of the training data was decisive for quality. The resulting model could generate short video clips and, importantly, served as a strong general-purpose foundation that could be adapted to related tasks such as turning a single image into a video and producing multiple consistent views of an object for 3D applications.

Released with open weights, Stable Video Diffusion gave the open community its first widely usable video generation foundation, arriving shortly before high-profile closed systems like OpenAI’s Sora drew mainstream attention to AI video. For a general reader, it shows the same pattern playing out in video that earlier played out in images: an open, reproducible model that documents what it actually takes, especially in data, to make generative video work.

Stable Video Diffusion: Scaling Latent Video Diffusion Models

Sources

Related