The Text-and-Data-Mining Exception

Text and data mining (TDM) is the automated analysis of large volumes of text, images, or other content to detect patterns, trends, and correlations - precisely the process by which AI training corpora are assembled and consumed. Because mining inevitably involves making copies of the underlying works, it would infringe copyright unless an exception applies. While the United States relies on the flexible, case-by-case fair-use doctrine, many other jurisdictions have enacted specific statutory TDM exceptions, and these have become the legal backbone of AI training outside the US.

The European Union’s approach, set out in Articles 3 and 4 of the 2019 Digital Single Market Directive, is the most influential. Article 3 gives research organizations and cultural heritage institutions a mandatory, non-waivable right to mine lawfully accessible works for scientific research. Article 4 extends a broader exception to everyone, including commercial AI developers - but with an opt-out: rightsholders may “expressly reserve” their works from mining, and that reservation must be machine-readable for online content. The practical effect is a default-yes-unless-you-object regime, which puts the burden on creators to signal refusal. Japan’s Article 30-4 goes further still, broadly permitting use of works for “non-enjoyment” purposes like data analysis with no opt-out, subject only to a proviso against unreasonably harming rightsholders. The United Kingdom, Singapore, and others have their own variants, and the UK’s narrow exception has been the subject of contentious reform debate.

Why business readers should care: the legal right to train on copyrighted data is not uniform across borders. The EU opt-out, Japan’s permissive rule, and US fair use each impose different obligations, which makes the choice of where data is mined and where a model is trained a genuine compliance and strategy decision.