Cosmos 3: Omnimodal World Models for Physical AI

On June 1, 2026, a large team led from NVIDIA released Cosmos 3, an omnimodal world model for physical AI, described in a paper with more than 290 authors. A world model is a system that learns an internal simulation of how the world behaves so it can predict what happens next and plan actions; physical AI refers to AI that perceives and acts in the real physical world, such as robots and autonomous machines.

Cosmos 3 unifies language, image, video, audio, and action within a single mixture-of-transformers architecture. The design consolidates what have usually been separate systems, including vision-language models, video generators, world simulators, and policy models that decide actions, into one framework that supports flexible combinations of inputs and outputs. The authors report state-of-the-art results, including top rankings as an open-source text-to-image and image-to-video model and a best policy model result from the RoboArena evaluation.

Notably, the release is open. The team published code, model checkpoints, synthetic datasets, and evaluation benchmarks under the Linux Foundation OpenMDW-1.1 license, positioning omnimodal world models as general-purpose backbones for embodied agents rather than closed products.

For businesses watching robotics and automation, this is significant because a single open model that perceives, generates, simulates, and acts lowers the barrier to building embodied AI systems and concentrates progress in one shared foundation rather than many narrow tools.

Sources

Last verified June 7, 2026