VSI-Bench
A visual-spatial benchmark testing whether multimodal models can understand and recall spaces from video.
How AI is measured - each tied to the paper or site that defined it.
A visual-spatial benchmark testing whether multimodal models can understand and recall spaces from video.
A benchmark that separates honesty from accuracy, measuring whether language models lie under pressure rather than just whether they are correct.
The 2025 successor to ARC-AGI, a harder reasoning benchmark that frontier AI systems initially scored in the single digits on.
OpenAI benchmark of 1,266 questions that force a web agent to dig persistently for hard-to-find facts.
A Meta hallucination benchmark built on a clear taxonomy, with tasks that regenerate to resist leakage.
OpenAI benchmark scoring AI on real economically valuable work across 44 occupations in nine GDP sectors.