“When Does Label Smoothing Help?” was submitted to arXiv on June 6, 2019 by Rafael Muller, Simon Kornblith, and Geoffrey Hinton at Google Brain. It examined a small training trick that had quietly become common practice and explained both why it works and where it backfires.
Label smoothing changes the targets a classifier is trained on. Instead of telling the network the correct class is exactly one and every other class is exactly zero, it softens these “hard” labels, assigning the right class a value slightly below one and spreading a small amount of probability across the rest. The authors found this yields two benefits: models generalize better, and they become better calibrated, meaning their confidence scores more honestly reflect how often they are right, which in turn improves tasks like beam search in translation.
The paper’s most interesting finding is a downside. By visualizing the network’s internal representations, the authors showed that label smoothing pushes examples of the same class into tight, well-separated clusters. That tightness aids generalization but erases fine information about how classes relate to one another. As a result, a teacher network trained with label smoothing is a noticeably worse source for knowledge distillation: the student learns less, because the subtle inter-class signal it would normally inherit has been smoothed away.
For a general reader, the paper is a good example of careful science around a practical tweak, showing that the same change can help one goal, accurate and well-calibrated prediction, while quietly undermining another, transferring knowledge to a smaller model.