FastText: Enriching Word Vectors with Subword Information

FastText, introduced in “Enriching Word Vectors with Subword Information” by Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov (submitted July 15, 2016), changed how words are turned into vectors. Earlier embedding methods such as word2vec assigned one distinct vector to each whole word and, as the authors note, “ignore the morphology of words.” That is a problem for languages with rich word forms and for any word the model never saw during training.

The key idea is to represent each word as “a bag of character n-grams.” Each short character sequence gets its own vector, and a word’s vector is the sum of the vectors of its parts. Because the building blocks are sub-word pieces, the method can compute a representation for an out-of-vocabulary word it never encountered, simply by combining the character fragments it does know. The authors report the approach is “fast, allowing to train models on large corpora quickly” and reaches state-of-the-art results on word similarity and analogy tasks across nine languages.

For a business reader, FastText matters because it made high-quality word vectors practical for messy real-world text full of typos, product codes, and rare names, and for languages other than English. It became a widely used open-source library for text classification and embedding before the transformer era took hold.

FastText: Enriching Word Vectors with Subword Information

Sources

Related