Benchmarks

How AI is measured - each tied to the paper or site that defined it.

102 entries, all primary-sourced
benchmark December 18, 2024

VSI-Bench

A visual-spatial benchmark testing whether multimodal models can understand and recall spaces from video.

benchmark March 5, 2025

MASK Benchmark (Honesty)

A benchmark that separates honesty from accuracy, measuring whether language models lie under pressure rather than just whether they are correct.

benchmark March 24, 2025

ARC-AGI-2

The 2025 successor to ARC-AGI, a harder reasoning benchmark that frontier AI systems initially scored in the single digits on.

benchmark April 16, 2025

BrowseComp

OpenAI benchmark of 1,266 questions that force a web agent to dig persistently for hard-to-find facts.

benchmark April 25, 2025

HalluLens

A Meta hallucination benchmark built on a clear taxonomy, with tasks that regenerate to resist leakage.

benchmark October 5, 2025

GDPval

OpenAI benchmark scoring AI on real economically valuable work across 44 occupations in nine GDP sectors.