“Neural Machine Translation of Rare Words with Subword Units” by Rico Sennrich, Barry Haddow, and Alexandra Birch (submitted August 31, 2015, ACL 2016) tackled a stubborn problem: neural translation systems work with a fixed vocabulary, so any rare or unseen word breaks them. Rather than fall back on a dictionary lookup, the authors proposed encoding rare and unknown words as sequences of smaller subword units.
Their argument was that “various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords.” To find good subword units automatically they adapted byte pair encoding (BPE), originally a data compression algorithm, to merge frequently co-occurring character pairs into reusable pieces. The result improved translation quality by 1.1 BLEU points for English-German and 1.3 BLEU points for English-Russian on WMT 15.
This paper matters because BPE became one of the standard ways modern language models break text into tokens. Almost every large model you use today, including the GPT family, relies on a BPE-style tokenizer descended from this idea, which is why a single uncommon word can sometimes cost several tokens.