Cybench is a framework for measuring how well language model agents can perform autonomous cybersecurity tasks, specifically identifying vulnerabilities and executing exploits. It was introduced in a paper led by Andy K. Zhang with 26 co-authors, including Dan Boneh and Percy Liang, first submitted on August 15, 2024, and later accepted as an oral presentation at ICLR 2025.
The benchmark consists of 40 professional-level capture-the-flag (CTF) tasks drawn from four distinct CTF competitions. Beyond the full tasks, Cybench decomposes challenges into subtasks that represent intermediary steps, allowing more granular assessment of where an agent succeeds or fails. The difficulty of the source tasks is calibrated by how long human teams took to solve them, ranging from challenges humans cleared in about 11 minutes to one that took an expert team nearly 25 hours.
When the authors tested eight models, including GPT-4o and Claude 3.5 Sonnet, the strongest agents could solve the easier challenges but fell well short on the hardest. All code and data were released publicly.
Cybench matters because offensive cyber capability is one of the clearest dangerous-capability concerns for AI, and a standardized, human-calibrated benchmark gives labs and regulators a concrete way to track how close agents are getting to autonomous hacking.