“METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” by Satanjeev Banerjee and Alon Lavie of Carnegie Mellon University, was presented at an ACL workshop on evaluation measures in Ann Arbor, Michigan, in June 2005 (pages 65-72). It was a direct response to the limitations of BLEU, the dominant automatic translation metric, and aimed to score machine translations in a way that lined up better with what human judges thought.
BLEU rewards exact word-sequence overlap between a candidate translation and human references. METEOR loosened that strictness. It matched words not only by their exact surface form but also by their stemmed form (so “running” matches “run”) and by meaning through synonym lookup, so a translation that chose a different but valid word was not unfairly penalized. It then combined precision and recall, with recall weighted more heavily, and added a penalty for badly ordered output, producing a score the authors showed correlated more closely with human judgment than BLEU - especially at the level of individual sentences rather than whole documents.
The metric was designed to address specific weaknesses critics had identified in BLEU: that it ignored recall, had no notion of synonymy, and worked poorly on single sentences. By incorporating stemming and synonym matching, METEOR moved automatic evaluation a step closer to judging meaning rather than surface word overlap.
METEOR became one of the standard metrics reported alongside BLEU in machine translation research and was extended to many languages over the following years. Together, BLEU and METEOR illustrate a central challenge in AI: progress depends on cheap automatic measures of quality, yet every such measure is an imperfect stand-in for human judgment, and the search for better metrics is never quite finished.