Scalable Extraction of Training Data from Production Language Models

“Scalable Extraction of Training Data from (Production) Language Models” was submitted to arXiv on November 28, 2023 by Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramer, and Katherine Lee, across Google DeepMind, Cornell, Carnegie Mellon, Berkeley, and the University of Washington. It scaled up the 2020 training-data extraction work to far larger and, critically, aligned production systems.

The study measured “extractable memorization”: training data an adversary can recover by querying a model without prior knowledge of its training set. Across open models (Pythia, GPT-Neo), semi-open models (Llama, Falcon), and the closed ChatGPT, the team extracted large volumes of memorized text. The headline result was a simple “divergence attack” on ChatGPT: prompting the chat model to repeat a single word, such as “poem,” forever caused it to break from its aligned chat behavior and begin emitting verbatim chunks of its training data, including personal information.

The finding mattered because ChatGPT is alignment-tuned and was widely assumed to be resistant to this kind of leakage. The result showed that alignment training masks but does not eliminate underlying memorization, and that a cheap, almost trivial prompt could bypass the protective behavior. The authors disclosed the issue before publication.

The paper underscored a recurring theme in AI security: safety behavior layered on top of a model can often be circumvented to reach the raw capabilities and raw memorized data underneath, which has direct implications for privacy, copyright, and the deployment of models trained on sensitive corpora.

Scalable Extraction of Training Data from Production Language Models

Sources

Related