Unsupervised Machine Translation Using Monolingual Corpora Only

“Unsupervised Machine Translation Using Monolingual Corpora Only,” posted to arXiv on October 31, 2017 by Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato, asked a striking question: can a system learn to translate between two languages without ever seeing a single matched pair of sentences? The answer was yes.

The method maps sentences from both languages into a shared latent space and trains the model to reconstruct text in each language from that common representation, using denoising and iterative back-translation to bootstrap a translator from monolingual data alone. The paper reported BLEU scores of 32.8 (English-French) and 15.1 (English-German) “without using even a single parallel sentence at training time,” far above earlier word-by-word baselines.

This mattered because high-quality parallel corpora exist for only a small number of language pairs, while raw monolingual text is abundant in many more languages. Unsupervised translation pointed to a path for the thousands of languages that lack the bilingual data that traditional systems depend on, and the techniques fed directly into later cross-lingual models like XLM and large multilingual translation systems.

For businesses, the durable insight is that valuable capabilities can sometimes be built from cheap, plentiful, unlabeled data when the expensive labeled data does not exist, which is the normal situation for most of the world’s languages.

Unsupervised Machine Translation Using Monolingual Corpora Only

Sources

Related