“Distilling the Knowledge in a Neural Network” was submitted to arXiv on March 9, 2015 by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean of Google. It gave a clean formulation and name to knowledge distillation, the technique of compressing the behavior of a large, expensive model into a small, cheap one.
The core observation is that a trained classifier’s output is richer than its single top answer. When a model predicts a digit, it does not just say “7” - it assigns small probabilities to “1” and “9” too, and the pattern of those small probabilities encodes what the model has learned about how classes resemble each other. Hinton’s team called these the “soft targets” and argued they carry far more information than the hard one-hot label. So instead of training a small “student” network on the original labels, you train it to match the soft probability distribution of a large “teacher” (or an ensemble of teachers), softened with a temperature parameter to expose the small probabilities. The student learns to generalize the way the teacher does, at a fraction of the size.
The paper showed gains on MNIST and on a commercial speech acoustic model, and also introduced specialist sub-models trained to disambiguate confusable classes, trained quickly in parallel.
Distillation became one of the standard tools for shipping models cheaply. It is behind compact models like DistilBERT, behind the practice of distilling large teacher models into deployable small ones, and behind much of how reasoning capability is transferred from frontier models into smaller open-weight ones. Like quantization, it is part of the toolkit that lets the capability of an expensive model run on modest hardware.