“Universal and Transferable Adversarial Attacks on Aligned Language Models” was submitted to arXiv on July 27, 2023 by Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson, working at Carnegie Mellon, the Center for AI Safety, and Google DeepMind. It carried the gradient-based adversarial-example idea from vision into the safety guardrails of large language models, and it did so automatically rather than by hand-written jailbreaks.
The method searches, using a combination of greedy and gradient-based techniques, for a short string of seemingly nonsensical characters, an adversarial suffix, that can be appended to a harmful request. The suffix is optimized to maximize the probability that the model produces an affirmative response rather than refusing. Appending it causes an otherwise safety-trained model to comply with requests it would normally decline.
The striking result was transferability. Suffixes trained against open-weight models such as Llama-2-Chat and Vicuna also induced harmful outputs from commercial closed systems the authors did not have gradient access to, including ChatGPT, Bard, and Claude. This showed that jailbreaks could be generated systematically and at scale, and that they generalize across models, rather than requiring a human to discover each one by trial and error.
The work, published with an accompanying website at llm-attacks.org, reframed LLM safety as an adversarial robustness problem closely analogous to the decade of prior work on adversarial examples in vision, and it underscored that current alignment training does not provide a reliable defense against optimized attacks.