RE-Bench, the Research Engineering Benchmark, evaluates whether AI language model agents can perform machine learning research and development tasks autonomously, and crucially it pits them against human experts on the same problems. It was introduced in a paper led by Hjalmar Wijk with 22 co-authors from METR, submitted on November 22, 2024, with an updated version in May 2025.
The benchmark comprises seven challenging, open-ended ML research environments. To build a human baseline, the authors collected 71 attempts from 61 distinct human experts working in eight-hour sessions. The headline result is a crossover in performance over time: with a two-hour budget the best agents scored about four times higher than humans, but humans pulled ahead at eight hours and reached roughly twice the agents’ score at 32 hours. AI agents also produced solutions about ten times faster and far more cheaply than humans.
This pattern, strong AI performance on short tasks but weaker performance as task horizons lengthen, is one of the most important empirical signals about how close agents are to automating AI research itself. METR open-sourced the environments, the human data, and the agent trajectories.
For a business reader, RE-Bench matters because automating AI research and development is both an enormous economic prize and a key dangerous-capability threshold, and this benchmark is one of the first rigorous ways to measure progress toward it.