On WebArena, a GPT-4 agent finished 14% of web tasks versus 78% for humans

fact July 25, 2023

When the WebArena benchmark was introduced in July 2023, its best-tested agent, built on GPT-4, completed just 14.41 percent of the realistic, multi-step web tasks end to end. Humans performing the same tasks succeeded 78.24 percent of the time. The roughly fivefold gap captured how far the most capable language models of the day were from reliably operating real websites, even as they appeared fluent in conversation - and it set a baseline that later computer-use and web agents were measured against.

Sources

PRIMARY https://arxiv.org/abs/2307.13854

Last verified June 7, 2026

<- Back to the AI Library

On WebArena, a GPT-4 agent finished 14% of web tasks versus 78% for humans

Sources

Related