“Gorilla: Large Language Model Connected with Massive APIs,” posted to arXiv on May 24, 2023 by Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez at UC Berkeley, tackled a narrow but practical problem: language models are good at describing what code should do but unreliable at producing the exact API call needed, often inventing functions or arguments that do not exist. Gorilla is a LLaMA model fine-tuned specifically to generate correct API calls.
To measure the problem the authors built APIBench, an evaluation set of real machine-learning APIs collected from HuggingFace, TorchHub, and TensorFlow Hub. On this benchmark the fine-tuned Gorilla model wrote API calls more accurately than GPT-4 and produced fewer hallucinated calls. A key part of the design is a document retriever that fetches the current API documentation at inference time, so when an API changes its signature the model can adapt instead of relying on stale memorized knowledge.
Gorilla matters because connecting a model to thousands of tools is only useful if the calls it emits actually run. The paper showed that targeted fine-tuning plus retrieval can make a smaller open model more reliable than a much larger general one at the specific job of calling tools correctly, a result that shaped how later agent systems handle large tool libraries. For a business, that reliability is the difference between an agent that completes a workflow and one that fails on the first malformed request.