AgentBench

AgentBench, posted to arXiv on August 7, 2023 and accepted at ICLR 2024, was one of the first attempts to measure LLMs as agents across many settings at once rather than on a single task. It bundles eight distinct environments - spanning operating-system interaction, databases, knowledge graphs, card games, puzzles, web shopping, web browsing, and a household simulation - and scores a model on its reasoning and decision-making as it takes actions and observes results in each.

The findings drew a sharp line between model tiers. The strongest proprietary models showed real agent ability, but there was a large gap between them and open-source competitors no bigger than 70B parameters. The authors pinned the main failures on three weaknesses: poor long-horizon reasoning, weak decision-making, and unreliable instruction following. They also reported a nuanced result on training data - that adding code to the training mix helped some agent tasks and hurt others, rather than helping uniformly.

AgentBench mattered because it gave the field a broad, multi-environment scorecard at the moment agent hype was peaking, and it grounded the conversation in measured capability. Its environments and methodology fed into the wave of agent benchmarks - including WebArena and later tool-use evaluations - that became the standard way to track whether agents were actually improving.

Sources

Last verified June 7, 2026