WebArena, introduced in a paper posted to arXiv on July 25, 2023 by a Carnegie Mellon-led team, is a benchmark environment for testing language-model agents on realistic web tasks. Rather than scripted toy pages, it provides fully functional, self-hosted websites across four common domains - an online store, a social forum, a collaborative software-development site, and a content-management system - augmented with tools like maps and reference material. Agents are given natural-language goals and must accomplish them by actually browsing and interacting with these sites.
The headline result was sobering. A GPT-4-based agent completed only 14.41 percent of the tasks end to end, against 78.24 percent for humans doing the same work. The gap made clear that, however fluent large models were at conversation, driving a real website to finish a multi-step goal was a different and much harder problem.
WebArena became a standard yardstick for the web-agent and computer-use research that followed, and its tasks reappear as evaluation targets for later agent systems. Because it is reproducible and self-contained, it gave the field a stable way to measure progress on the unglamorous but essential skill of getting things done in a browser.