Intriguing Properties of Neural Networks

“Intriguing properties of neural networks” was submitted to arXiv on December 21, 2013 by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus, then spread across Google and New York University. It is the paper that introduced what the field now calls adversarial examples, and it remains one of the most cited results in machine learning security.

The authors reported two surprising findings about deep neural networks. The first concerned interpretation: individual high-level neurons carried no more semantic meaning than random linear combinations of neurons, suggesting that “it is the space, rather than the individual units, that contains the semantic information.” The second finding became the lasting one. The networks learned input-output mappings that were, in the authors’ words, “fairly discontinuous.” By searching for a small perturbation that maximized the network’s prediction error, they could take an image the network classified correctly, add a change too small for a human to notice, and cause confident misclassification.

The most unsettling detail was transferability. The same crafted perturbation that fooled one network often fooled a different network trained on a different subset of the data. This implied the vulnerability was not a quirk of one model’s training run but a structural property of the models themselves, which made it both scientifically interesting and practically dangerous.

The paper opened an entire research area. It set up the central question that the follow-up work “Explaining and Harnessing Adversarial Examples” tried to answer the next year, and it seeded a now-vast literature on attacks, defenses, and the robustness of machine learning systems.

Intriguing Properties of Neural Networks

Sources

Related