Spatial intelligence is the ability to perceive, understand, reason about, and interact with physical and virtual spaces in three dimensions: knowing not just what an object is, but where it is, how it relates to everything around it, and how it will move or respond when acted upon. In a November 10, 2025 essay titled “From Words to Worlds: Spatial Intelligence is AI’s Next Frontier,” the computer-vision researcher Fei-Fei Li argued that this capability, which humans use intuitively to park a car or catch a ball, is the scaffolding on which much of cognition is built, and that it represents the frontier beyond language for AI.
Li’s central critique is that large language models, for all their fluency, remain ungrounded in physical reality; she calls them eloquent but operating in the dark. The proposed technical path to spatial intelligence is the world model, which she defines through three properties: it must be generative, creating geometrically and physically consistent worlds; multimodal, accepting inputs such as images, text, gestures, and actions; and interactive, predicting how a world changes in response to actions or goals.
The concept ties together several threads in modern AI, including 3D scene generation, generative video, robotics, and embodied agents, around a common claim about what intelligence requires. For a general reader, spatial intelligence frames why the field is pushing past chatbots: the applications that demand grounded understanding of space, from household robots to scientific simulation to immersive design, need machines that model the world, not just describe it.