“Practical Black-Box Attacks against Machine Learning” was submitted to arXiv on February 8, 2016 by Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami, and presented at ASIA CCS 2017. It showed that you do not need to know a model’s internals to attack it.
Earlier adversarial-example work mostly assumed white-box access, where the attacker can see the model’s parameters and gradients. This paper attacked the black-box case, where the attacker can only send inputs and observe the labels the system returns. The method trains a local “substitute” model: the attacker queries the target with synthetically generated inputs, uses the target’s answers as labels, and trains its own model to imitate the target. The attacker then crafts adversarial examples against the substitute. Because adversarial examples transfer between models that solve the same task, those examples also fool the original target.
The authors demonstrated this against real deployed systems, reporting success rates of 84.24 percent against MetaMind, 96.19 percent against Amazon, and 88.94 percent against Google’s machine-learning services, all without any knowledge of the targets’ architecture or weights. This transferability is what makes adversarial examples a practical, not just theoretical, threat.
For a business reader, the lesson is uncomfortable: keeping your model private does not keep it safe. An attacker who can only use your prediction API can still learn enough to build inputs that reliably fool it, so security through obscurity is not a defense against adversarial inputs.