TheAgentCompany

TheAgentCompany, full title “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks,” was posted to arXiv on December 18, 2024 by a large team led by Frank F. Xu and Graham Neubig at Carnegie Mellon University, with collaborators. It asks a question employers actually care about: how much of a knowledge worker’s job can an AI agent do on its own?

To answer it, the benchmark builds a self-contained, reproducible environment that mimics a small software company - internal websites, data, and even simulated colleagues that the agent must communicate with - and defines 175 diverse, realistic professional tasks spanning software engineering, project management, financial analysis, and administrative work. Agents are graded with checkpoints and execution-based checks, giving partial credit for partial progress rather than only pass or fail. The paper reported that the most competitive agent completed around 30 percent of tasks autonomously: simpler tasks were within reach, but the difficult, long-horizon work that fills a real job still defeated current systems.

TheAgentCompany became a widely cited reality check against the hype that agents were about to replace large fractions of office work. By grounding the claim in consequential, multi-step workplace tasks - and reporting how far short agents fell - it gave businesses a concrete sense of what “AI does your job” actually meant at the end of 2024.

Sources

Related