Genie is Google DeepMind’s line of foundation world models, systems that generate interactive environments a person or an AI agent can act inside, rather than passive video. The original Genie was described in “Genie: Generative Interactive Environments,” posted to arXiv on February 23, 2024 by Jake Bruce, Michael Dennis, and colleagues. It was an 11-billion-parameter model trained on large amounts of unlabeled internet video, with no action labels at all. The system combined a video tokenizer, a dynamics model, and a learned latent action model that discovered controllable actions on its own, letting it generate 2D playable worlds from an image, sketch, or text prompt.
Genie 2, announced on December 4, 2024, scaled the idea to action-controllable 3D environments generated from a single image. DeepMind reported emergent capabilities including realistic physics such as gravity, water, and smoke, character animation and object interaction, long-horizon memory that recalls parts of a world after they leave view, and counterfactual generation in which the same starting frame can branch into different outcomes based on the user’s actions.
The motivation behind Genie is partly practical: training embodied agents needs diverse environments, and hand-building them is slow and expensive. A model that conjures endless playable worlds offers a path to generating that training ground automatically. For a general reader, Genie shows generative AI moving beyond making images and clips toward making interactive spaces, a foundation for game creation, robot training, and the broader pursuit of AI that understands how the physical world responds to action.