“Semi-supervised sequence tagging with bidirectional language models” by Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power (submitted April 29, 2017, ACL 2017), often called TagLM, is the work that set up ELMo a year later. At the time, NLP systems used static word embeddings, but the networks that consumed them were trained only on small amounts of labeled data. TagLM asked whether unlabeled text could help more directly.
The answer was to add “pre-trained context embeddings from bidirectional language models” into a sequence-tagging network. A language model trained on large unlabeled corpora produces representations that depend on the surrounding sentence, and feeding those into a tagger gave it richer context than fixed word vectors could. The method reached state-of-the-art results on named entity recognition and chunking, showing that transfer learning from a language model was a powerful, general boost.
TagLM matters because it is the bridge between static embeddings like word2vec and the contextual embeddings that ELMo, and later BERT, made famous. It demonstrated the core insight, that a word’s representation should change with its context, that would soon reshape the entire field.