Lumiere: A Space-Time Diffusion Model for Video Generation

“Lumiere: A Space-Time Diffusion Model for Video Generation,” posted to arXiv on January 23, 2024 by Omer Bar-Tal, Hila Chefer, and colleagues at Google Research, proposed a different way to build a text-to-video model. Most prior systems generated a few distant keyframes and then filled in the gaps with a separate temporal super-resolution step, an approach that often produced flicker or inconsistent motion because the model never reasoned about the whole clip at once.

Lumiere’s key idea was a Space-Time U-Net architecture that generates the entire temporal extent of a video in a single pass. By downsampling and upsampling the signal in both space and time and processing it at multiple scales simultaneously, the model could commit to globally coherent motion from the start. It was built on top of a pre-trained text-to-image diffusion model, reusing that model’s knowledge of how the visual world looks, and it supported applications such as image-to-video, stylized generation, and video inpainting.

Lumiere arrived in the same wave as Sora and the first commercial video generators, and its single-pass, full-duration design influenced how researchers thought about temporal consistency. For a general reader, it illustrates a recurring lesson in generative video: getting smooth, believable motion is less about resolution than about whether the model considers the clip as a whole rather than stitching together pieces.

Lumiere: A Space-Time Diffusion Model for Video Generation

Sources

Related