“Improving Neural Machine Translation Models with Monolingual Data,” by Rico Sennrich, Barry Haddow, and Alexandra Birch of the University of Edinburgh, was published at the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 86-96. It introduced back-translation, a now-standard trick for squeezing more out of limited translation data.
Neural translation models need parallel data - sentences paired with their human translations - and such data is scarce, especially for less common languages. But monolingual text in the target language is plentiful. The paper’s idea was to take that abundant monolingual text and run it through a translation model in reverse to produce a rough source-language version, creating synthetic sentence pairs. These automatically generated pairs, though imperfect on the source side, give the model far more examples of correct target-language sentences to learn from.
The gains were significant. The authors reported substantial improvements of roughly +2.8 to +3.7 BLEU on the WMT English-German task and +2.1 to +3.4 BLEU on a lower-resource Turkish-English task, simply by augmenting real parallel data with back-translated synthetic data. The technique worked because the decoder learns better target-language fluency from the extra well-formed sentences, without needing a separate language model bolted on.
Back-translation became one of the most widely used data-augmentation methods in machine translation and a key ingredient in large multilingual systems, including efforts to support low-resource languages where parallel data is most scarce. It is a clean example of a broader pattern in AI - using a model to generate its own training data - and a useful counterpoint to model collapse, the failure mode where training on synthetic data degrades quality, showing that synthetic data can help when used carefully.