chrF: Character n-gram F-score for MT Evaluation

chrF is a machine translation evaluation metric introduced by Maja Popovic in “chrF: character n-gram F-score for automatic MT evaluation,” presented at the Tenth Workshop on Statistical Machine Translation (WMT 2015) in Lisbon. Instead of comparing whole words to a reference translation, chrF compares overlapping sequences of characters and combines precision and recall into an F-score.

The motivation is that word-level metrics like BLEU struggle with languages that pack a lot of grammar into word endings, such as Finnish, Turkish, or many African and Indian languages. In those languages two correct translations can share meaning but differ in surface word forms, which a word-overlap metric unfairly penalizes. By scoring at the character level, chrF rewards getting the stems and shared character sequences right, so it lines up better with human judgments across a wide range of languages.

chrF became a standard companion metric in WMT shared tasks and in low-resource translation research, often reported alongside or instead of BLEU for languages where word boundaries are a poor unit of comparison. A later variant, chrF++, added some word-level information to improve correlation further.

For teams evaluating translation into many languages, chrF matters because relying on BLEU alone can make a system look worse than it is in exactly the languages that are hardest to serve.

chrF: Character n-gram F-score for MT Evaluation

Sources

Related