Memorization and Regurgitation as Infringement

Large language models learn by adjusting billions of parameters to predict text, and in the process they sometimes store specific passages closely enough to reproduce them word for word. This phenomenon, called memorization, becomes regurgitation when a model outputs those memorized passages in response to a prompt. It sits at the heart of the copyright debate over generative AI: if a model can be coaxed into emitting verbatim copies of copyrighted articles, books, or code, then the output - not just the training - may itself be infringing, and the developer’s fair-use defense weakens considerably.

The issue became concrete in the New York Times v. OpenAI lawsuit, where the Times included an exhibit of roughly one hundred examples in which GPT-4, given the opening of a Times article, continued it with long stretches of near-identical text. OpenAI countered that the Times had “intentionally manipulated prompts” to force this behavior, framing regurgitation as an edge case rather than normal operation. Academic study has illuminated the dynamics: research analyzing the lawsuit found that memorization capacity grows sharply with model size, that systems above roughly 100 billion parameters memorize substantially more, and that frontier providers increasingly deploy refusal training and output filters specifically to block verbatim reproduction. The harder a model can be pushed to regurgitate, the stronger the infringement argument; the more it can be shown to learn abstract patterns rather than store copies, the stronger the transformative-use defense.

Why business readers should care: memorization is both a legal and a technical risk. Deploying a model that can be induced to reproduce protected content exposes a company to infringement claims, which is why output filtering and careful evaluation of regurgitation are becoming standard parts of responsible AI deployment.

Memorization and Regurgitation as Infringement

Sources

Related