On November 22, 2016, Google researchers Mike Schuster, Melvin Johnson, and Nikhil Thorat described a multilingual translation system in which a single neural network handled many language pairs at once, accompanied by the paper “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation” (arXiv 1611.04558). The mechanism was strikingly simple: they prepended a small token to each input sentence indicating the desired target language, leaving the encoder, decoder, and attention shared across all languages.
The headline result was zero-shot translation. A model trained on Japanese-English and Korean-English pairs could translate directly between Japanese and Korean, a pair it had never seen translated during training. This was the first clear demonstration that a translation model could generalize to unseen language pairs, an example of transfer learning where knowledge gained on some pairs carried over to others for free.
Most intriguingly, the team found evidence that the model had built a shared internal representation of meaning. When they visualized how the network encoded sentences, sentences with the same meaning from different languages clustered together. As they put it, “the network must be encoding something about the semantics of the sentence rather than simply memorizing phrase-to-phrase translations.” This looked like an emergent version of the interlingua that the Vauquois triangle had described decades earlier - a language-independent representation - except learned automatically rather than hand-designed.
The work shifted machine translation toward massively multilingual single models, the design that would later scale to systems like Meta’s No Language Left Behind covering 200 languages. It also became an early, concrete example of a theme that now runs through large language models: train one model on many tasks or languages together, and it learns shared structure that lets it do things no single training example taught it.