“Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, was published in the Journal of Machine Learning Research in 2014. It described one of the most widely used regularization tricks of the deep learning era, valued precisely for how simple it is.
The problem dropout addresses is overfitting: a large network can memorize quirks of its training data and fail to generalize. Dropout combats this by randomly “dropping” units, temporarily removing each neuron along with its connections with some probability, on every training pass. Because the network can never rely on any particular neuron being present, it cannot build up fragile co-adaptations where units only work in concert; instead it must learn redundant, more robust features.
At test time you do not drop anything. Instead you use the full network with the weights scaled down to account for the fact that more units are now active. The authors showed this is a cheap approximation to averaging the predictions of the exponentially many “thinned” networks dropout implicitly trained, an ensemble effect achieved at the cost of a single model.
The paper reported gains across vision, speech recognition, document classification, and computational biology, setting state-of-the-art results on several benchmarks. For a general reader, dropout is a neat example of how deliberately injecting noise and unreliability into training can make the final system more dependable.