Attention is a mechanism that lets a neural network decide, for each part of its output, which parts of the input matter most. Instead of compressing an entire input into one fixed summary and hoping nothing important is lost, an attention-equipped model can look back over the whole input and weight each piece by how relevant it is to the decision at hand. It is, in plain terms, a learnable spotlight.
The idea predates the Transformer. It was introduced in 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in “Neural Machine Translation by Jointly Learning to Align and Translate.” The authors identified that forcing a translation system to squeeze a whole sentence into “a fixed-length vector is a bottleneck,” and proposed instead that the model “automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word.” When the model translated a word, it could attend back to the relevant source words rather than relying on one compressed snapshot. They found the resulting alignments “agree well with intuition,” matching how a human translator maps words between languages.
That insight became the seed of the modern era. Three years later, the Transformer architecture took attention from a helpful add-on to the entire foundation of the model, dispensing with the older recurrent machinery altogether. Nearly every large language model today is built on attention.
Why business readers should care: attention is the core idea that made today’s AI possible, and the word turns up constantly in technical descriptions of models. Understanding it as “a learnable way to focus on what is relevant” demystifies a great deal of the jargon and clarifies why these models handle context and long-range relationships so much better than their predecessors.