Extracting Training Data from Large Language Models

“Extracting Training Data from Large Language Models” was submitted to arXiv on December 14, 2020 by a large team led by Nicholas Carlini and Florian Tramer, spanning Google, OpenAI, Stanford, Apple, Northeastern, and the University of California. It demonstrated a training data extraction attack: an adversary with only query access to a language model can recover individual examples that were in its training data, verbatim.

Working against GPT-2, the authors extracted hundreds of memorized sequences, including names, phone numbers, email addresses, IRC conversations, code, and 128-bit UUIDs. Crucially, many of the recovered sequences appeared in just a single document in the training corpus, showing that the model memorized them rather than learning a general pattern. The team verified that extracted strings were genuine training data by cross-checking against the original sources.

A central finding was that larger models memorize more and are therefore more vulnerable to extraction than smaller ones. Because the trend toward ever-larger models was already clear, this implied the privacy problem would grow rather than shrink with scale, an uncomfortable conclusion for the field.

The paper established that memorization in language models is a measurable privacy risk, not a theoretical worry. It set the template for later, larger studies, including the 2023 follow-up that extracted data from production systems such as ChatGPT, and it informed how organizations think about training on private or copyrighted text.

Extracting Training Data from Large Language Models

Sources

Related