“On Information and Sufficiency” by Solomon Kullback and Richard Leibler appeared in The Annals of Mathematical Statistics, volume 22, issue 1, in March 1951 (pages 79-86). It introduced the quantity now universally called the Kullback-Leibler divergence, or KL divergence.
Building on Claude Shannon’s information theory, the paper defined a measure of the “information for discrimination” between two probability distributions: how much evidence, on average, an observation gives for one distribution over another. The resulting number is zero when the two distributions are identical and grows as they diverge. It is not a true distance - it is not symmetric, and the divergence from distribution A to B differs from B to A - but it has become the default way to quantify how far one distribution is from another.
KL divergence sits at the heart of modern machine learning. Minimizing it is equivalent to maximum-likelihood training, it defines the cross-entropy loss used to train classifiers and language models, and it appears in variational inference, variational autoencoders, and the reinforcement-learning objectives used to fine-tune large language models.
Why business readers should care: nearly every modern AI model is trained by minimizing a quantity derived from KL divergence, making this 1951 statistical idea one of the load-bearing concepts behind today’s systems.