Benchmarks

How AI is measured - each tied to the paper or site that defined it.

102 entries, all primary-sourced

benchmark June 21, 2022

PlanBench

A planning benchmark from the automated-planning community that exposes how poorly LLMs generate valid plans.

benchmark July 6, 2022

FLORES-200 (low-resource translation benchmark)

FLORES-200 is Meta's human-translated benchmark covering 200+ languages and 40,000+ translation directions, built to evaluate low-resource MT.

benchmark September 14, 2022

WebLI

Google's web-scale multilingual image-text dataset, around 10 billion images with text in over 100 languages, built to train the PaLI vision-language model.

benchmark October 6, 2022

MGSM

A multilingual grade-school math benchmark of 250 problems translated into ten diverse languages.

benchmark October 13, 2022

MTEB: Massive Text Embedding Benchmark

A broad benchmark and leaderboard that tests text embedding models across 8 task types, 58 datasets, and 112 languages.

benchmark October 17, 2022

BIG-Bench Hard (BBH)

The 23 hardest BIG-Bench tasks where models trailed humans, used to show chain-of-thought prompting unlocks hidden ability.

benchmark November 17, 2022

HELM (Holistic Evaluation of Language Models)

A Stanford framework that scored language models on seven metrics across many scenarios, not just accuracy, for transparent comparison.

benchmark November 18, 2022

DS-1000

A benchmark of 1,000 real data-science coding problems across seven Python libraries, drawn from StackOverflow.

benchmark 2023

Open LLM Leaderboard

Hugging Face's public ranking that scored open models on the same reproducible benchmarks, drawing over two million visitors.

benchmark April 13, 2023

AGIEval

A benchmark that grades foundation models on real human exams like the SAT, LSAT, and college entrance tests.

benchmark April 14, 2023

API-Bank

A 2023 benchmark that runs real API calls to test whether LLMs can plan, retrieve, and correctly invoke tools.

benchmark April 27, 2023

DataComp

A 2023 benchmark that fixes the training code and competes on the data instead, using a pool of 12.8 billion image-text pairs to test dataset curation.

benchmark May 2, 2023

EvalPlus (HumanEval+ and MBPP+)

Augmented versions of HumanEval and MBPP with far more tests, exposing wrong code and changing model rankings.

benchmark May 23, 2023

FActScore

An evaluation that breaks long text into atomic facts and scores the share supported by a reliable source.

benchmark June 5, 2023

RepoBench

A benchmark for repository-level code completion that tests retrieval and prediction across multiple files.

benchmark June 9, 2023

Mind2Web

A 2023 benchmark of 2,000+ real tasks across 137 websites for testing agents that follow instructions on any site.

benchmark June 9, 2023

MT-Bench

A set of multi-turn questions graded by a strong model acting as judge, used to score chat assistants on open-ended replies.

benchmark July 12, 2023

MultiMedQA

MultiMedQA, introduced with Med-PaLM in 2023, bundles seven medical question-answering sets to test how well LLMs encode clinical knowledge.

benchmark July 25, 2023

WebArena

A 2023 CMU benchmark of realistic, self-hosted websites that exposed how poorly LLM agents perform real web tasks.

benchmark August 7, 2023

AgentBench

A 2023 benchmark spanning eight environments that measured how well LLMs act as agents, exposing a wide commercial-vs-open gap.

benchmark October 3, 2023

MathVista

A benchmark of 6,141 problems that test whether models can do math reasoning over images, charts, and diagrams.

benchmark November 8, 2023

Needle In A Haystack

Greg Kamradt's simple test that plants a fact in long text and checks if a model can retrieve it.

benchmark November 14, 2023

IFEval (Instruction-Following Eval)

A benchmark of about 500 prompts with verifiable instructions, scored automatically without human or model judges.

benchmark November 21, 2023

GAIA (General AI Assistants benchmark)

A 2023 Meta and Hugging Face benchmark of 466 real-world assistant tasks where humans scored 92% and GPT-4 with plugins only 15%.

benchmark November 27, 2023

MMMU (Massive Multi-discipline Multimodal Understanding)

A 2023 benchmark of 11,500 college-level multimodal questions across 30 subjects, where GPT-4V scored only 56%.

benchmark November 30, 2023

Ego-Exo4D

A 1,286-hour dataset pairing first-person and third-person video of skilled activities like sports, music, and dance.

benchmark January 25, 2024

WebVoyager

A 2024 benchmark and agent that uses screenshots and multimodal models to complete tasks on 15 real websites.

benchmark February 6, 2024

HarmBench

A standardized framework for automated red teaming, comparing attacks that try to make models produce harmful content.

benchmark March 5, 2024

WMDP (Weapons of Mass Destruction Proxy)

A benchmark of 3,668 questions that proxies hazardous biology, cyber, and chemistry knowledge in language models, paired with an unlearning method.

benchmark March 12, 2024

LiveCodeBench

A code benchmark that keeps collecting new contest problems over time so models cannot have memorized the answers.

benchmark April 9, 2024

RULER

A synthetic long-context benchmark that finds most models hold far less context than they advertise.

benchmark April 11, 2024

OSWorld

A 2024 benchmark of 369 real computer tasks across Ubuntu, Windows, and macOS where humans scored 72% and the best agent only 12%.

benchmark April 21, 2024

FineWeb

Hugging Face's openly documented 15-trillion-token web dataset that set a new bar for transparent large-scale pre-training data.

benchmark June 3, 2024

MMLU-Pro

A harder version of MMLU with ten answer choices and reasoning-focused questions, where scores dropped 16 to 33 percent.

benchmark June 17, 2024

tau-bench (Tool-Agent-User benchmark)

A 2024 Sierra benchmark that tests tool-using agents in simulated customer conversations and measures how consistently they succeed.

benchmark June 22, 2024

BigCodeBench

A code benchmark of 1,140 practical tasks that require calling many real libraries and following complex instructions.

benchmark June 27, 2024

LiveBench

A contamination-limited benchmark with monthly fresh questions and objective scoring across six task categories.

benchmark July 24, 2024

Gymnasium

Gymnasium is the maintained successor to OpenAI Gym, providing the de facto standard API for RL environments.

benchmark August 13, 2024

SWE-bench Verified

A human-validated 500-task subset of SWE-bench, built with OpenAI, that became the headline measure of AI coding agents.

benchmark August 15, 2024

Cybench

A benchmark of 40 professional capture-the-flag tasks that measures how well AI agents can perform real cybersecurity work.

benchmark September 4, 2024

MMMU-Pro

A hardened version of MMMU that strips out questions text-only models can guess and adds a vision-only mode.

benchmark October 7, 2024

GSM-Symbolic

A template-based variant of GSM8K showing model math accuracy drops when numbers change or irrelevant clauses are added.

benchmark October 11, 2024

AgentHarm

A benchmark of malicious agent tasks that tests whether tool-using LLM agents refuse harmful requests and resist jailbreaks.

benchmark November 7, 2024

FrontierMath

Epoch AI's benchmark of original research-level math problems that leading models initially solved less than two percent of.

benchmark November 7, 2024

SimpleQA

OpenAI benchmark of 4,326 short factual questions that measures whether a model knows what it knows.

benchmark November 22, 2024

RE-Bench (Research Engineering Benchmark)

A METR benchmark of seven open-ended ML research tasks that compares AI agents against human experts on AI R&D work.

benchmark December 4, 2024

Global-MMLU

A Cohere-led rebuild of MMLU across 42 languages that flags culturally biased and Western-centric questions.

benchmark December 18, 2024

TheAgentCompany

A 2024 CMU benchmark of 175 real workplace tasks in a simulated software company, where the best agent finished only about 30% autonomously.