BrowseComp

BrowseComp is a benchmark for web-browsing agents introduced by OpenAI researchers in April 2025, with the tagline “a simple yet challenging benchmark for browsing agents.” It contains 1,266 questions, each requiring an agent to search and navigate the open web to locate a piece of information that is hard to find but, once found, has a short and easily verifiable answer. The questions are constructed so that the answer cannot be guessed or recalled from memory; the agent has to actually browse, follow leads, and cross-check sources.

The design philosophy is to do for browsing agents what competitive programming did for coding agents: provide a clean, gradable target that stresses the core skill without the messiness of open-ended responses. Because answers are short and checkable, grading is reliable, while the difficulty comes from persistence and creativity in the search itself. The authors are explicit that BrowseComp deliberately sidesteps harder evaluation problems like judging long reports or resolving ambiguous requests, focusing instead on the ability to keep digging until the fact is found.

For businesses watching the rise of AI research assistants and “deep research” tools, BrowseComp is a useful yardstick. It measures the unglamorous but essential skill of finding a needle of fact across the messy live web, which is exactly what users expect from an agent that promises to look things up for them.

Sources

Related