“Deep contextualized word representations” was submitted to arXiv in February 2018 and presented at NAACL 2018 by Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer of the Allen Institute for AI and the University of Washington. It introduced ELMo, short for Embeddings from Language Models.
The key shift was contextual representation. Static embeddings such as word2vec and GloVe assign each word a single fixed vector, so the “bank” of a river and a financial “bank” share one representation. ELMo instead derives a word’s vector from the internal states of a deep bidirectional language model run over the whole sentence, so the same word gets different vectors depending on how it is used. This let a single representation capture both syntax and the way meaning shifts with context.
ELMo embeddings, dropped into existing task models, improved results across six benchmark NLP tasks including question answering, textual entailment, and sentiment analysis. It was a major step in the move from static word vectors to contextual, pre-trained language representations - a transition completed later in 2018 by the Transformer-based BERT, which made deep bidirectional pre-training the standard.