Stealing Part of a Production Language Model

“Stealing Part of a Production Language Model” was submitted to arXiv on March 11, 2024 by Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tramer, a team spanning Google DeepMind, ETH Zurich, the University of Washington, OpenAI, and McGill. It showed that even fully closed, commercial language models leak some of their internal structure through their ordinary APIs.

The attack targets the final layer of a transformer, the embedding projection matrix that maps the model’s hidden state to output token probabilities. By querying the API and analyzing the returned probability information, the authors recovered the hidden dimension of several production models and, for smaller ones, the full projection matrix. They reported extracting that matrix from OpenAI’s Ada and Babbage models for under 20 US dollars, revealing hidden dimensions of 1024 and 2048, and recovering the hidden dimension of gpt-3.5-turbo, estimating that extracting its full projection matrix would cost under 2,000 dollars. The work was done in coordination with the affected providers, who adjusted their APIs in response.

This is a precise, surgical form of model extraction: it does not steal the whole model, but it proves that the boundary between “closed” and “open” is softer than assumed, and it provides a method that could be extended to recover more.

For a business reader, the result is a caution about what an API reveals. Exposing rich outputs such as full probability distributions can leak proprietary details about the model behind the API, so the design of what an API returns is itself a security decision.

Sources

Last verified June 7, 2026