“Diffusion for World Modeling: Visual Details Matter in Atari,” posted to arXiv on May 20, 2024 by Eloi Alonso, Adam Jelley, Vincent Micheli, and colleagues, introduced DIAMOND, short for Diffusion As a Model Of eNvironment Dreams. A world model is a learned simulator of an environment that an agent can train inside, “dreaming” experience instead of interacting with the real game or system. Most prior world models compressed each frame into a small discrete code, and the authors argued this throws away visual details that matter for learning.
DIAMOND instead used a diffusion model as the world simulator, preserving the visual fidelity of frames. Trained agents that practiced entirely within this diffusion world model reached a mean human-normalized score of 1.46 on the competitive Atari 100k benchmark, a new best at the time for agents trained solely inside a learned model rather than the actual game. The paper was accepted as a spotlight at NeurIPS 2024, and the authors released both code and playable world models.
DIAMOND connected two threads that were converging in 2024: the diffusion models powering generative video and the world models used to train embodied agents. For a general reader, it shows why the same generative machinery behind AI video is increasingly seen as a way to build training grounds for robots and game agents, where letting an AI rehearse inside a learned simulation can be far cheaper and safer than real-world trials.