BLEU: a Method for Automatic Evaluation of Machine Translation

“BLEU: a Method for Automatic Evaluation of Machine Translation” was presented by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu of IBM’s T. J. Watson Research Center at the 40th Annual Meeting of the Association for Computational Linguistics in Philadelphia in July 2002. The name stands for “bilingual evaluation understudy,” reflecting the authors’ framing of it as an automated stand-in for human judges.

The problem it solved was speed. Human evaluation of translation quality, the authors note, “can take months to finish and involve human labor that can not be reused,” which blocked the rapid daily testing developers needed. BLEU offered a metric that was quick, cheap, language-independent, and - they argued - correlated highly with human evaluation.

The method compares a candidate translation against one or more high-quality human reference translations using modified n-gram precision: it counts how many word sequences in the candidate appear in the references, but clips each matching word at the maximum number of times it occurs in any single reference so a system cannot game the score by repeating a common word. A brevity penalty discourages translations that are too short. The paper’s central idea is simply stated: “The closer a machine translation is to a professional human translation, the better it is.”

BLEU became the dominant evaluation metric for machine translation and a model for automatic evaluation across NLP, and it received the NAACL 2018 Test-of-Time Award. Its limitations - rewarding surface word overlap rather than meaning - later drove a search for better metrics, but for two decades a single BLEU number was how translation systems reported progress.

Sources

Last verified June 7, 2026