VSI-Bench

VSI-Bench, the Visual-Spatial Intelligence Benchmark, was introduced in a December 2024 paper titled “Thinking in Space,” whose authors include Li Fei-Fei and Saining Xie. It contains more than 5,000 question-and-answer pairs built from video walkthroughs of real indoor environments, and it tests whether a multimodal model can do what humans do almost effortlessly: watch a video of a space, build a mental model of where things are, and then answer questions about distances, sizes, counts, and relative positions of objects.

The findings show a clear gap. Models performed competitively on some questions but stayed well below human level overall, and the authors identified spatial reasoning as the central bottleneck rather than basic perception. Notably, chain-of-thought prompting, which helps on many text and math tasks, did not improve spatial performance here. What did help was having the model build an explicit cognitive map as it worked, which improved its judgments about spatial distance, suggesting that representing space internally is the missing piece.

Spatial intelligence underlies robotics, navigation, augmented reality, and any application where an AI must act in the physical world rather than just process text. VSI-Bench matters because it isolates that capability and shows that today’s strong language-and-vision models still fall short of human spatial understanding.

Sources

Related