“Neural Machine Translation by Jointly Learning to Align and Translate” was posted to arXiv in September 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. It is the paper that introduced the attention mechanism, the single idea that underpins almost every modern language model. At the time, the leading approach to machine translation read a whole source sentence and crushed it into one fixed-length vector of numbers, which a second network then expanded into the translated sentence.
The authors argued that this fixed-length vector was the bottleneck. Forcing every sentence, short or long, through the same small summary lost information, and performance fell off sharply as sentences grew longer. Their solution was to let the model, when producing each output word, “automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word.” Instead of relying on one compressed snapshot, the model could look back over the entire input and weight each word by how relevant it was to the decision at hand. This learnable, selective focus is what came to be called attention.
The results validated the idea: the attention-based model handled long sentences far better than the fixed-vector baseline, and the alignments it learned, which source words it focused on for each translated word, matched the intuitions of a human translator mapping words between two languages. Attention turned out to be both more accurate and more interpretable.
Why this paper matters: three years later, the Transformer took this add-on and made it the entire architecture, dispensing with the older recurrent machinery and building models out of attention alone. Every large language model in use today traces its core mechanism back to this 2014 paper, which is why it is one of the most consequential works in the modern history of AI.