COMET is a neural machine translation evaluation metric introduced in “COMET: A Neural Framework for MT Evaluation,” posted to arXiv on September 18, 2020 by Ricardo Rei, Craig Stewart, Ana C. Farinha, and Alon Lavie. It was published at EMNLP 2020 and built to correlate much more closely with human quality judgments than traditional metrics.
The long-standing problem with metrics like BLEU is that they count surface word overlap with a reference translation, so they penalize valid paraphrases and reward literal matches that may read poorly. COMET instead uses a cross-lingual pretrained language model to look at the source sentence, the candidate translation, and a human reference together, and predicts a quality score learned from human assessment data. On the WMT 2019 Metrics shared task it achieved state-of-the-art correlation with human ratings.
COMET, alongside character-level metrics, marked a shift in how the field measures translation quality: from counting matching words toward learned models that judge meaning. It is now a standard reporting metric in WMT competitions and in industry MT evaluation.
For anyone deploying translation, the takeaway is that automatic quality numbers are only as trustworthy as the metric behind them, and a learned metric like COMET tracks what humans actually prefer far better than older overlap scores.