Aider Polyglot Benchmark

The Aider polyglot benchmark is an evaluation maintained by the Aider open-source coding assistant project. It uses 225 of the most challenging Exercism coding exercises across six programming languages: C++, Go, Java, JavaScript, Python, and Rust. Rather than testing general coding knowledge, it focuses on a specific practical skill: whether a model can correctly edit existing code to satisfy a task without human intervention.

Because Aider works by having a model produce edits to files, the benchmark measures more than correctness. It tracks the overall pass rate, the edit-format accuracy (whether the model formats its changes in a way the tool can apply), the rate of malformed responses, and cost and token usage per task. This captures a failure mode that pure code-generation benchmarks miss: a model can write correct code yet still fail because it cannot reliably express the change as an applicable edit.

The polyglot benchmark became a widely cited public leaderboard for comparing frontier models as coding agents. For businesses choosing an AI coding tool, it is a useful reality check, since day-to-day software work is mostly editing existing code across several languages, not writing isolated functions from scratch.

Sources

Related