tau-bench (Tool-Agent-User benchmark)

tau-bench, written as the Greek letter in the original, was introduced in a paper posted to arXiv on June 17, 2024 by Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan at the agent startup Sierra. Its full title is “tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.” Unlike benchmarks that test an agent on a fixed task, tau-bench simulates a dynamic conversation: a language model plays a customer, while the agent under test must use domain APIs and follow a written policy to help. The released domains were retail and airline customer service.

The benchmark’s most influential idea is its reliability metric. Instead of reporting only an average pass rate, it reports pass^k - the probability that an agent succeeds on the same task across k independent attempts - which captures consistency, not just best-case ability. The results were humbling: even strong function-calling agents like GPT-4o succeeded on fewer than half the tasks, and their pass^8 in the retail domain fell below 25 percent, meaning they rarely got the same task right eight times in a row.

tau-bench landed at the moment companies were trying to deploy agents into real customer-facing workflows, where inconsistency is a dealbreaker. By measuring rule-following and repeatability rather than one-off cleverness, it reframed the agent-evaluation conversation around dependability, and pass^k became a widely cited way to talk about how trustworthy an agent really is.

Sources

Last verified June 7, 2026