Towards Evaluating the Robustness of Neural Networks (C&W Attack)

“Towards Evaluating the Robustness of Neural Networks” was submitted to arXiv on August 16, 2016 by Nicholas Carlini and David Wagner at UC Berkeley, and presented at IEEE Security and Privacy 2017. It is the origin of what the field calls the C&W attacks, a family of optimization-based methods for crafting adversarial examples.

The authors framed adversarial example generation as an optimization problem that finds the smallest change to an input (measured in the L0, L2, or L-infinity norm) that causes a misclassification, using a carefully designed objective and change-of-variables trick to make the search efficient. Their attacks found adversarial examples with 100 percent success against the networks they tested. Crucially, they used these attacks to show that defensive distillation, a defense that had been proposed to harden networks, did not actually provide meaningful robustness once a strong attack was applied.

The paper’s lasting contribution is methodological. It argued that defenses must be evaluated against strong, adaptive attacks rather than weak ones, and the C&W attacks became a standard yardstick for that evaluation. Many subsequently published defenses were later broken when measured against C&W or similar adaptive attacks.

For a business reader, the takeaway is that a security claim about an AI model is only as good as the strength of the attack used to test it. A defense that stops a weak attack can collapse against a determined adversary, so rigorous, worst-case evaluation is essential before trusting a model in an adversarial setting.

Towards Evaluating the Robustness of Neural Networks (C&W Attack)

Sources

Related