Adversarial Examples

An adversarial example is an input that has been deliberately modified, usually by a tiny amount that a person would never notice, so that a machine learning model makes a confident mistake. The classic demonstration is an image of a panda that a network classifies correctly, to which an attacker adds a faint pattern; the result looks identical to a human but the network now labels it a gibbon. The phenomenon was discovered in the 2013 paper “Intriguing properties of neural networks” and explained the following year in “Explaining and Harnessing Adversarial Examples,” which traced it to the linear behavior of high-dimensional models and introduced a simple recipe (FGSM) for producing such inputs.

The perturbations are found by following the model’s own gradients to the direction that most increases its error. Two properties make this more than a curiosity. First, the changes can be made imperceptibly small while still flipping the prediction. Second, an example crafted to fool one model often fools other models trained separately, a property called transferability, which means an attacker does not always need access to the exact system being targeted.

Adversarial examples are not confined to clean digital images. Researchers have produced physical attacks, such as stickers on a stop sign that cause a vision system to misread it, universal perturbations that fool a model on most images at once, and, more recently, adversarial suffixes that jailbreak large language models. The same gradient-based logic carries across modalities. Years of proposed defenses have been repeatedly broken, and no defense provides robustness comparable to clean accuracy.

Why business readers should care: any AI system that makes decisions about untrusted inputs, content moderation, fraud detection, biometric access, autonomous driving, can in principle be probed for adversarial inputs by a motivated attacker. The practical lesson is to assume the model can be fooled, keep a human or a second control in the loop for high-stakes decisions, and treat robustness as a property to be tested adversarially rather than assumed.

Sources

Last verified June 7, 2026