“Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation” was submitted to arXiv on June 3, 2014 by Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Its headline contribution was a translation model built from two recurrent networks - an encoder that compresses a source phrase into a fixed-length vector and a decoder that expands it - but the piece of the paper that endured was a new recurrent cell introduced to make that model trainable: the gated recurrent unit, or GRU.
The GRU is a streamlined cousin of the Long Short-Term Memory cell. Where the LSTM uses three gates and a separate memory cell, the GRU uses just two gates - an update gate and a reset gate - and folds the memory back into the hidden state. The update gate decides how much of the past to keep versus how much new information to write; the reset gate decides how much of the past to ignore when forming a candidate update. With fewer gates the GRU has fewer parameters and is slightly cheaper to run, while still solving the vanishing-gradient problem that the gating was designed to address.
A companion study by Chung and the same group later that year (“Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” arXiv 1412.3555) compared the GRU against the LSTM on music and speech tasks and found them roughly comparable, with neither clearly dominant. That rough parity, plus the GRU’s simplicity, made it a common default in the years when recurrent networks ran most sequence work, before attention and the Transformer displaced recurrence for large-scale language modeling.
The paper also introduced the encoder-decoder framing that the attention mechanism would soon extend - the same Bahdanau who co-authored it published the attention paper months later - making this one of the quiet hinge points between the recurrent era and what came after.