Decoupled Weight Decay Regularization (AdamW)

“Decoupled Weight Decay Regularization” was submitted to arXiv on November 14, 2017 by Ilya Loshchilov and Frank Hutter. It corrected a subtle but consequential mistake in how the popular Adam optimizer was being used, and the fix it proposed, known as AdamW, is now the standard way to train large neural networks including most modern language models.

The paper’s core observation is that two things often treated as identical are not. L2 regularization adds a penalty on large weights into the loss function, which then flows through the gradient. Weight decay instead shrinks every weight by a small fraction directly at each update step. For plain stochastic gradient descent these two are mathematically equivalent, so the field had grown used to swapping the terms. But for adaptive optimizers like Adam, which rescale each parameter’s update by its own running gradient statistics, the equivalence breaks: folding the penalty into the gradient causes the regularization to be rescaled inconsistently across parameters.

Loshchilov and Hutter showed that decoupling weight decay from the gradient-based update, applying it as a separate shrinkage step, restores the intended behavior and lets the decay strength be tuned independently of the learning rate. Empirically this closed much of the gap by which Adam had been underperforming SGD on image classification, and the modification was quickly adopted into TensorFlow and PyTorch.

The episode is a useful reminder for any technical reader that widely repeated “common knowledge” can be quietly wrong, and that a careful look at the math behind a default setting can yield a real improvement used by millions.

Decoupled Weight Decay Regularization (AdamW)

Sources

Related