Latent Dirichlet Allocation

“Latent Dirichlet Allocation” was published by David M. Blei, Andrew Y. Ng, and Michael I. Jordan in the Journal of Machine Learning Research in 2003 (volume 3, pages 993-1022). It introduced LDA, described in the paper as “a generative probabilistic model for collections of discrete data such as text corpora,” and became the foundational method for topic modeling.

The core idea is that each document is a mixture of a small number of latent topics, and each topic is a distribution over words. A document about finance might draw most of its words from a “markets” topic and a “regulation” topic; a document about cooking from quite different ones. LDA is unsupervised - it is never told what the topics are - and instead discovers them by finding the set of word distributions that best explains the observed documents, using a three-level hierarchical Bayesian model.

Because exactly computing these hidden structures is intractable, the authors developed efficient approximate inference based on variational methods and an EM algorithm. They showed LDA outperformed earlier approaches such as the mixture of unigrams and probabilistic latent semantic indexing on document modeling, text classification, and collaborative filtering.

LDA gave organizations a practical way to organize and summarize large unlabeled text collections - archives, customer feedback, scientific literature - without anyone tagging the themes in advance. It dominated text analysis through the 2000s and early 2010s and remains a reference point even after neural embeddings and large language models offered newer ways to capture meaning.

Sources

Related