SWE-bench Verified

SWE-bench Verified is a curated, human-validated subset of SWE-bench, the benchmark that asks an AI system to resolve real GitHub issues from open-source Python projects by producing a patch that passes the project’s tests. It was released in August 2024, created in collaboration with OpenAI, to fix a measurement problem in the original benchmark: some of its tasks were effectively impossible or had broken or under-specified tests, which meant the raw SWE-bench score understated how capable models really were.

To build it, expert software engineers reviewed SWE-bench problems for whether the issue description was clear, the test patch was correct, and the task was actually solvable from the available information, and selected 500 instances that passed that scrutiny. The result is a cleaner, more trustworthy signal of an agent’s ability to do realistic software-engineering work end to end - read a bug report, navigate a codebase, and produce a working fix.

SWE-bench Verified quickly became the headline number that frontier labs cited when announcing coding agents and models, and the scores climbed fast as agentic coding tools matured through 2024 and 2025. As with every benchmark, rising scores eventually compress the gap between top systems, so current leaderboard numbers change frequently and are not reproduced here. Its history is a clean example of how careful benchmark curation - not just bigger models - is part of how a field measures real progress.

Sources

Related