BIG-bench (Beyond the Imitation Game)

BIG-bench, short for “Beyond the Imitation Game benchmark,” was introduced in a June 2022 paper of the same name. It is a collaboratively built evaluation suite: at publication it consisted of 204 tasks contributed by 450 authors across 132 institutions. The tasks deliberately span domains believed to be at or beyond the edge of what language models could then do - linguistics, mathematics, common-sense reasoning, biology, physics, social bias, software development, and more - and many were designed to be hard to game.

The benchmark’s purpose was less to produce a single score than to map capabilities and how they change with scale. The authors evaluated models of different sizes and compared them against human raters, looking for where performance improved smoothly with scale, where it jumped suddenly, and where extra size barely helped. That made BIG-bench an important early instrument for studying so-called emergent behavior and the limits of scaling.

A distilled subset, BIG-bench Hard, pulled out the tasks where models still lagged humans and became a popular target in its own right - notably for showing how chain-of-thought prompting could unlock large gains on exactly those hard tasks. BIG-bench helped move evaluation away from narrow single-task leaderboards toward broad, diverse probing of general capability, a direction later benchmarks such as HELM and MMLU continued.

Sources

Last verified June 7, 2026