“A Neural Probabilistic Language Model” by Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin, published in the Journal of Machine Learning Research in 2003, is a foundational ancestor of today’s language models. It confronts what the authors call “the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training.” Traditional n-gram models, which just count word sequences, cannot generalize to combinations they never saw.
The paper’s breakthrough was to learn “a distributed representation for words,” a numeric vector for each word, alongside a neural network that predicts the next word from those vectors. Because similar words end up with nearby vectors, the model can give reasonable probability to a brand new sentence “if it is made of words that are similar to words forming an already seen sentence.” In other words, it generalizes through meaning rather than memorized counts, and it improved on n-gram baselines across multiple text corpora.
This paper matters because nearly every modern language system, from word2vec to the transformer, rests on this twin idea of learned word vectors plus a neural predictor. Reading it shows that the core intuition behind ChatGPT was articulated more than two decades ago, well before the data and hardware existed to make it dominant.