Cosmos 3: Omnimodal World Models for Physical AI

On June 1, 2026, a large team led from NVIDIA released Cosmos 3, an omnimodal world model for physical AI, described in a paper with more than 290 authors. A world model is a system that learns an internal simulation of how the world behaves so it can predict what happens next and plan actions; physical AI refers to AI that perceives and acts in the real physical world, such as robots and autonomous machines.

Cosmos 3 unifies language, image, video, audio, and action within a single mixture-of-transformers architecture. The design consolidates what have usually been separate systems, including vision-language models, video generators, world simulators, and policy models that decide actions, into one framework that supports flexible combinations of inputs and outputs. The authors report state-of-the-art results, including top rankings as an open-source text-to-image and image-to-video model and a best policy model result from the RoboArena evaluation.

Notably, the release is open. The team published code, model checkpoints, synthetic datasets, and evaluation benchmarks under the Linux Foundation OpenMDW-1.1 license, positioning omnimodal world models as general-purpose backbones for embodied agents rather than closed products.

For businesses watching robotics and automation, this is significant because a single open model that perceives, generates, simulates, and acts lowers the barrier to building embodied AI systems and concentrates progress in one shared foundation rather than many narrow tools.

Cosmos 3: Omnimodal World Models for Physical AI

Sources

Related