Many-Shot Jailbreaking

Many-shot jailbreaking is a technique for getting a safety-trained language model to answer requests it is supposed to refuse. It was described by Anthropic in research published on April 2, 2024. The attack exploits the very long context windows that modern models now support, turning a capability into a vulnerability.

The method works by filling the prompt with a large number of fabricated dialogue turns, each one a “shot” in which a fictional assistant cheerfully answers a harmful question. After many such examples, the attacker appends the real harmful request. Anthropic found that with enough faux dialogues, up to 256 in their tests, the model’s safety conditioning is overwhelmed and it tends to comply with the final request. A handful of examples does not trigger the behavior, but the effect grows as the number of shots increases, following a regular scaling pattern. In effect the model’s strong in-context learning, its ability to pick up patterns from examples in the prompt, is used against its alignment training.

Anthropic published the work deliberately, reasoning that the vulnerability was likely to be discovered independently as long-context models proliferated and that disclosure would speed up defenses across the industry. They reported that mitigation techniques based on classifying and modifying prompts reduced one attack’s success rate from 61 percent to 2 percent.

For a business reader, many-shot jailbreaking illustrates that scaling up a useful capability can open new attack surfaces, and that safety measures must be re-evaluated as models gain features like longer context, not assumed to carry over unchanged.

Sources

Related