ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

“ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs,” posted to arXiv on July 31, 2023 by Yujia Qin, Shihao Liang, and a large team mostly from Tsinghua University, set out to close the gap between proprietary models like ChatGPT and open-source models at using external tools. The headline number is scale: the work collected 16,464 real-world REST APIs spanning 49 categories from the RapidAPI hub, far more than earlier tool-use datasets.

The paper has three connected pieces. ToolBench is the instruction-tuning dataset, built in stages by using ChatGPT to generate realistic instructions and to produce solution paths for them. ToolLLaMA is a LLaMA model fine-tuned on ToolBench and paired with a neural API retriever that picks relevant tools from the huge library. ToolEval is an automatic evaluation framework for judging tool-use quality. To handle hard multi-step tasks, the authors also introduced a depth-first search over a decision tree so the model can explore and back out of dead-end action sequences rather than committing to the first plan.

ToolLLM matters because it pushed tool use from a few hand-picked utilities toward the messy reality of thousands of third-party APIs, complete with authentication, varied schemas, and frequent failures. For organizations, that is the regime that actually matters: real agents must navigate large, imperfect catalogs of services, and ToolBench gave the open-source community a concrete way to train and measure for it.

Sources

Last verified June 7, 2026