Explaining and Harnessing Adversarial Examples

“Explaining and Harnessing Adversarial Examples” was submitted to arXiv on December 20, 2014 by Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy, all then at Google. It is the source of the field’s signature illustration: a photo of a panda that a classifier labels correctly, plus a faint perturbation, equals an image that looks identical to a human but that the network labels “gibbon” with high confidence.

The 2013 discovery paper had left the cause of adversarial examples open, speculating about the extreme nonlinearity of deep networks. This paper argued the opposite. Its central claim was that “the primary cause of neural networks’ vulnerability to adversarial perturbation is their linear nature.” In high-dimensional input spaces, many tiny changes that are each individually negligible can add up to a large change in the network’s output. That linear view also explained why the same perturbation transfers across models trained differently: they learn similar linear behaviors.

To make the argument concrete, the authors introduced the Fast Gradient Sign Method (FGSM), a one-step way to generate an adversarial example by nudging every input pixel a small amount in the direction that increases the loss. FGSM is cheap, easy to implement, and became the standard baseline attack and a common ingredient in adversarial training.

The paper also showed that training on adversarial examples could improve robustness and even act as a regularizer, founding the practice of adversarial training. Together with the 2013 paper, it framed the attack-and-defense dynamic that has defined adversarial machine learning ever since.

Sources

Last verified June 7, 2026