ARC-AGI-2

ARC-AGI-2 is the 2025 successor to the ARC-AGI benchmark, the abstraction-and-reasoning test created by Francois Chollet in which a system must infer a transformation rule from a few example grid pairs and apply it to a new input. ARC-AGI-2 was launched after frontier reasoning systems had largely conquered the first version - OpenAI’s o3 had jumped ARC-AGI-1 from single digits to the high 80s - and was deliberately redesigned to restore a wide human-machine gap.

The new version removes puzzle types that proved susceptible to brute-force search, adds task types requiring symbolic interpretation, compositional reasoning, and contextual rule application, and introduces explicit efficiency metrics so cost counts, not just accuracy. Each evaluation set holds 120 tasks. The benchmark is human-calibrated: every task was solved by at least two people in under two attempts during testing with hundreds of participants, while leading AI systems scored only in the single digits at launch - a striking reversal of the saturated first benchmark.

ARC-AGI-2 keeps the central premise of the ARC family: tasks that are easy for humans but hard for machines are a sharper signal of genuine fluid reasoning than tasks where models can lean on memorized knowledge. Paired with an ongoing prize competition, it is meant to resist the saturation and contamination that eventually blunt most benchmarks, and to keep pointing at the capabilities that still separate AI from human-style general reasoning.

Sources

Related