Infringement risk relating to training a generative AI system
Global | Publication | July 2024
Generative AI systems are trained using vast amounts of data, often taken from sources in the public domain that may be protected by copyright or other intellectual property rights, such as, in the UK and EU, a database right.
Could training a generative AI system using publicly accessible copyright work constitute an infringement?
Where the system is trained using a copyright-protected work without the copyright owner’s consent, and assuming that the training involves an act of copying1 of the whole or a substantial part of the work, this would in many jurisdictions be an infringement, unless a relevant defense or exception applies.
Case study: GermanyWhether a defense or exception applies depends on the jurisdiction in which the training occurs. For example, in Germany:
|
Do relevant defenses/exceptions exist (assuming the system is used for commercial purposes)?
Australia
Not likely. The Copyright Act 1968 (Cth) (CA 1968) ‘fair dealing’ defenses for copyright infringement include the following dealings: research or study, criticism and review, news reporting, or parody and satire.2 However, the focus in Australia is also on what is considered to be ‘fair,’ and commercial objectives being the driving force behind the infringement are typically not considered to be fair. In addition, these defenses are rarely applied in Australia and are generally considered narrower than in other common law jurisdictions. |
Canada
There is no text and data mining (TDM) exception under the Canadian Copyright Act (R.S.C., 1985, c. C-42), but two general exceptions may apply to training a Generative AI system: (1) the temporary reproduction for a technological process exception; and (2) the fair dealing exception.
To qualify for the temporary reproduction for technological processes exception, three requirements must be met:3
A generative AI program that processes large datasets may need to make temporary reproductions of copyright material that are essential to its technological process. If reproductions are temporary and only exist for the duration of the dataset analysis, they may be covered by the exception of temporary reproduction for technological processes.
Similar to the US ‘fair use’ exception, the Canadian Copyright Act has a fair dealing exception. This allows use of copyright works for the purpose of research, private study, education, satire, parody, criticism, review or news reporting, provided that the use of the work is ‘fair’.4
If the purpose of use is for criticism, review or news reporting, then the source and author of the work must be cited. Whether something is ‘fair’ will depend on the circumstances, and several factors will be considered in the analysis:5
When considering use in training generative AI systems, ‘research’ may be a relevant, fair dealing. The Supreme Court of Canada has held that ‘research is not limited to non-commercial or private contexts’ and should be otherwise afforded liberal interpretation.6 One Supreme Court case,7 for example, found that listening to 30- to 90-second music previews to determine a user’s musical preferences constituted research for the purposes of the fair dealing exception. However, a Canadian court has not considered whether training generative AI systems using copyrighted material is within the scope of the ‘research’ exception. |
China
No. There is currently no TDM exception existing in the PRC copyright law system. Generally, the relevant exception to an infringement claim in PRC law only applies to non-commercial usage for personal study, research or appreciation or copying a small quantity for teaching or science research purposes. It will not apply to a system for substantial/large-scale commercial usage purposes. |
EU
Yes. TDM (that is, reproduction and extraction) of lawfully accessible works is permitted for any purpose provided that the rights holder has not ‘expressly reserved’ its rights in an appropriate manner, which may be in a machine-readable way where the content is available online.8 For information on the regulation of AI in the EU, see our blog, The EU AI Act – the countdown begins. |
France
Same as EU position.9 |
Germany
Same as EU position. |
Hong Kong
No. There is no TDM exception under the Hong Kong Copyright Ordinance, and the use in training of a generative AI system is unlikely to fall within any of the fair dealing exceptions under that Ordinance. |
The Netherlands
Same as EU position. |
Singapore
Yes. There is a statutory exception permitting the copying of copyright works for the purpose of ‘computational data analysis’, which includes:
The Intellectual Property Office of Singapore has clarified that ‘computational data analysis’ includes sentiment analysis,TDM and training machine learning.11 However, the exception is subject to certain conditions and safeguards to protect the commercial interests of copyright owners:
While this statutory exception allows a generative AI system to be trained without infringing copyright (as long as the above conditions are met), there is still a risk that the Output of the generative AI system will infringe copyright. For more information on:
|
South Africa
No, provided it constitutes a substantive reproduction or adapation of the original work (and authorship and ownership thereof can be proven), there would be no defense to copyright infringement in such circumstances. |
UK
No. A statutory exception for TDM exists, but is only available for non-commercial research purposes.12 |
USA
The fair use doctrine may apply to protect the challenged activity; however, it has not been tested yet, and it is not clear to what extent it would apply. It is highly likely that the training process will involve the reproduction of entire works or substantial portions. OpenAI, for example, acknowledges that its programs are trained on large, publicly available datasets that include copyright works, and that copies of such works are made as part of the process. The copying of copyright works without consent (express or implied) from the copyright owner may result in liability for copyright infringement. It is expected that AI companies will argue that their training processes constitute fair use and, therefore, do not infringe any work copied. Whether or not copying constitutes fair use depends on four statutory factors under 17 U.S.C. § 107:
AI advocates are likely to argue that consideration of these factors requires a conclusion of fair use. For example, under the first factor, AI companies may argue that their purpose is ‘transformative’ because the training process creates a useful generative AI system, rather than an expressive work. Under the third factor, note that the copies are not made available to the public but are used only to train the program, an argument in which a court may weigh in favor of a fair use conclusion. In contrast, some generative AI applications have raised concern that training AI programs on copyright works allows them to generate works that compete with the original works. Such evidence would be considered under the fourth fair use factor and would likely weigh against a conclusion of fair use. For information on the regulation of AI in the US, see our blog, President Biden issues sweeping artificial intelligence directives targeting safety, security and trust. |
Footnotes
A computer scientist’s view of training is that it does not strictly involve the creation of a copy of the training data per se. Rather, the training data is transformed into a mathematical model that, in the case of a written source, converts the words into tokens and ‘learns’ the correlations between tokens. Nevertheless, the assumption is that the model can reproduce the training data (and for example ChatGPT can quote verbatim certain texts that it has apparently been trained on) and, if so, it may ultimately not matter to the copyright analysis in what form the data is stored within the model.
Generative AI
Subscribe and stay up to date with the latest legal news, information and events . . .