FrontierMath

FrontierMath is a mathematics benchmark introduced by Epoch AI in the November 2024 paper “FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI.” Unlike competition-style benchmarks such as AIME or the MATH dataset, its problems are original, unpublished, and pitched at research level - the paper describes problems that typically take a professional mathematician multiple hours, and the hardest ones multiple days, to solve. The problems were crafted and vetted by expert mathematicians across major branches of the field.

The benchmark was designed to be hard precisely where existing math benchmarks had become easy. When it was released, state-of-the-art models solved under two percent of the problems, a stark contrast with the near-saturation scores models were posting on grade-school and competition math. Because the problems are unpublished, FrontierMath also resists the contamination that quietly inflates scores when test questions leak into training data.

FrontierMath was structured into difficulty tiers plus a set of genuinely open research problems, and it became a closely watched yardstick for whether AI was approaching expert-level mathematical reasoning. Its later association with a funding-disclosure controversy, covered in a separate entry, became part of a broader debate about transparency and conflicts of interest in AI benchmarking. The benchmark itself remains a demanding test of advanced reasoning that frontier systems are still far from mastering.

Sources

Related