BigCodeBench moves code evaluation closer to real software work. Instead of self-contained puzzles, its tasks require a model to call functions from many different libraries and to follow detailed, realistic instructions, which is what programmers actually do when building applications.
The benchmark was introduced by Terry Yue Zhuo and a large team of collaborators in “BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions,” posted in June 2024 and accepted to ICLR 2025 as an oral presentation. It contains 1,140 fine-grained tasks drawing on 139 libraries across 7 domains, with each task backed by an average of about 5.6 test cases reaching 99 percent branch coverage.
Evaluating 60 models, the authors found a large gap between machine and human performance: models scored up to 60 percent while humans reached 97 percent. Their conclusion was that current models are not yet able to follow complex instructions while using function calls precisely.
For a business reader, BigCodeBench is a more honest measure of whether an AI assistant can handle production-style coding, where success depends on correctly wiring together existing tools and libraries.