API-Bank, introduced in “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs” on arXiv on April 14, 2023 by Minghao Li, Yingxiu Zhao, Bowen Yu, and colleagues, and later published at EMNLP 2023, was one of the first benchmarks built to measure tool use rigorously. Rather than judging tool calls on paper, it provides a runnable evaluation system of 73 real API tools, so an agent’s calls actually execute and can be checked against real outcomes.
The benchmark is organized around three questions an effective tool user must answer: when to call an API, which API to call, and how to call it correctly with the right arguments. To test these it includes 314 annotated dialogues containing 753 API calls. The authors also released a larger training set of 1,888 tool-use dialogues drawn from 2,138 APIs across roughly a thousand domains, and used it to train a tool-augmented model called Lynx that outperformed baseline approaches.
API-Bank matters because it pushed evaluation past whether a model produces plausible-looking calls toward whether those calls actually work when run. That runnable, end-to-end approach became standard for serious agent benchmarks. For a business reader, it reflects the practical bar that matters in production: not that an assistant sounds like it is using your systems correctly, but that the requests it sends them actually succeed.