PlanBench
A planning benchmark from the automated-planning community that exposes how poorly LLMs generate valid plans.
How AI is measured - each tied to the paper or site that defined it.
A planning benchmark from the automated-planning community that exposes how poorly LLMs generate valid plans.
FLORES-200 is Meta's human-translated benchmark covering 200+ languages and 40,000+ translation directions, built to evaluate low-resource MT.
Google's web-scale multilingual image-text dataset, around 10 billion images with text in over 100 languages, built to train the PaLI vision-language model.
A multilingual grade-school math benchmark of 250 problems translated into ten diverse languages.
A broad benchmark and leaderboard that tests text embedding models across 8 task types, 58 datasets, and 112 languages.
The 23 hardest BIG-Bench tasks where models trailed humans, used to show chain-of-thought prompting unlocks hidden ability.
A Stanford framework that scored language models on seven metrics across many scenarios, not just accuracy, for transparent comparison.
A benchmark of 1,000 real data-science coding problems across seven Python libraries, drawn from StackOverflow.
Hugging Face's public ranking that scored open models on the same reproducible benchmarks, drawing over two million visitors.
A benchmark that grades foundation models on real human exams like the SAT, LSAT, and college entrance tests.
A 2023 benchmark that runs real API calls to test whether LLMs can plan, retrieve, and correctly invoke tools.
A 2023 benchmark that fixes the training code and competes on the data instead, using a pool of 12.8 billion image-text pairs to test dataset curation.
Augmented versions of HumanEval and MBPP with far more tests, exposing wrong code and changing model rankings.
An evaluation that breaks long text into atomic facts and scores the share supported by a reliable source.
A benchmark for repository-level code completion that tests retrieval and prediction across multiple files.
A 2023 benchmark of 2,000+ real tasks across 137 websites for testing agents that follow instructions on any site.
A set of multi-turn questions graded by a strong model acting as judge, used to score chat assistants on open-ended replies.
MultiMedQA, introduced with Med-PaLM in 2023, bundles seven medical question-answering sets to test how well LLMs encode clinical knowledge.
A 2023 CMU benchmark of realistic, self-hosted websites that exposed how poorly LLM agents perform real web tasks.
A 2023 benchmark spanning eight environments that measured how well LLMs act as agents, exposing a wide commercial-vs-open gap.
A benchmark of 6,141 problems that test whether models can do math reasoning over images, charts, and diagrams.
Greg Kamradt's simple test that plants a fact in long text and checks if a model can retrieve it.
A benchmark of about 500 prompts with verifiable instructions, scored automatically without human or model judges.
A 2023 Meta and Hugging Face benchmark of 466 real-world assistant tasks where humans scored 92% and GPT-4 with plugins only 15%.
A 2023 benchmark of 11,500 college-level multimodal questions across 30 subjects, where GPT-4V scored only 56%.
A 1,286-hour dataset pairing first-person and third-person video of skilled activities like sports, music, and dance.
A 2024 benchmark and agent that uses screenshots and multimodal models to complete tasks on 15 real websites.
A standardized framework for automated red teaming, comparing attacks that try to make models produce harmful content.
A benchmark of 3,668 questions that proxies hazardous biology, cyber, and chemistry knowledge in language models, paired with an unlearning method.
A code benchmark that keeps collecting new contest problems over time so models cannot have memorized the answers.
A synthetic long-context benchmark that finds most models hold far less context than they advertise.
A 2024 benchmark of 369 real computer tasks across Ubuntu, Windows, and macOS where humans scored 72% and the best agent only 12%.
Hugging Face's openly documented 15-trillion-token web dataset that set a new bar for transparent large-scale pre-training data.
A harder version of MMLU with ten answer choices and reasoning-focused questions, where scores dropped 16 to 33 percent.
A 2024 Sierra benchmark that tests tool-using agents in simulated customer conversations and measures how consistently they succeed.
A code benchmark of 1,140 practical tasks that require calling many real libraries and following complex instructions.
A contamination-limited benchmark with monthly fresh questions and objective scoring across six task categories.
Gymnasium is the maintained successor to OpenAI Gym, providing the de facto standard API for RL environments.
A human-validated 500-task subset of SWE-bench, built with OpenAI, that became the headline measure of AI coding agents.
A benchmark of 40 professional capture-the-flag tasks that measures how well AI agents can perform real cybersecurity work.
A hardened version of MMMU that strips out questions text-only models can guess and adds a vision-only mode.
A template-based variant of GSM8K showing model math accuracy drops when numbers change or irrelevant clauses are added.
A benchmark of malicious agent tasks that tests whether tool-using LLM agents refuse harmful requests and resist jailbreaks.
Epoch AI's benchmark of original research-level math problems that leading models initially solved less than two percent of.
OpenAI benchmark of 4,326 short factual questions that measures whether a model knows what it knows.
A METR benchmark of seven open-ended ML research tasks that compares AI agents against human experts on AI R&D work.
A Cohere-led rebuild of MMLU across 42 languages that flags culturally biased and Western-centric questions.
A 2024 CMU benchmark of 175 real workplace tasks in a simulated software company, where the best agent finished only about 30% autonomously.