OpenAI's GPTBot and the robots.txt fight over training data

In August 2023, OpenAI introduced GPTBot, a dedicated web crawler used to gather content for training its models, and - crucially - published a way for websites to opt out. OpenAI’s documentation states plainly that “GPTBot is used to crawl content that may be used in training our generative AI foundation models,” and that disallowing it “signals that a site’s content should not be used in training generative AI foundation models.” The crawler identifies itself with a specific user-agent string (GPTBot, with a link to openai.com/gptbot), and OpenAI even publishes the IP ranges it crawls from.

The opt-out mechanism is the decades-old robots.txt convention. A site adds two lines:

User-agent: GPTBot
Disallow: /

That instruction is voluntary - robots.txt is an honor system, not a technical block - but OpenAI committed to honoring it. Within weeks of the announcement, a large share of major news sites, including The New York Times, Reuters, and CNN, had added GPTBot to their robots.txt to keep their archives out of future training runs.

The episode crystallized a shift. For most of the web’s history, robots.txt governed search-engine crawling, where being indexed brought traffic back to the publisher. AI training crawling offers no such return, so publishers began treating “may I train on your content?” as a separate decision, and increasingly answered no.

Why business readers should care: GPTBot turned training-data access into an explicit, site-by-site negotiation. If you run a content business, whether and how you allow AI crawlers is now a real strategic choice - tied to licensing revenue, competitive risk, and the broader copyright disputes over scraped data.

OpenAI's GPTBot and the robots.txt fight over training data

Sources

Related