On WebArena, a GPT-4 agent finished 14% of web tasks versus 78% for humans

When the WebArena benchmark was introduced in July 2023, its best-tested agent, built on GPT-4, completed just 14.41 percent of the realistic, multi-step web tasks end to end. Humans performing the same tasks succeeded 78.24 percent of the time. The roughly fivefold gap captured how far the most capable language models of the day were from reliably operating real websites, even as they appeared fluent in conversation - and it set a baseline that later computer-use and web agents were measured against.

Sources

Last verified June 7, 2026