Infringement risk relating to training a generative AI system

Global | Publication | July 2024

Generative AI systems are trained using vast amounts of data, often taken from sources in the public domain that may be protected by copyright or other intellectual property rights, such as, in the UK and EU, a database right.

Could training a generative AI system using publicly accessible copyright work constitute an infringement?

Where the system is trained using a copyright-protected work without the copyright owner’s consent, and assuming that the training involves an act of copying¹ of the whole or a substantial part of the work, this would in many jurisdictions be an infringement, unless a relevant defense or exception applies.

Case study: Germany

Whether a defense or exception applies depends on the jurisdiction in which the training occurs. For example, in Germany:

Under German copyright law, there are several exceptions upon which Providers can rely when using publicly available copyright works to train their generative AI system: Section 44b of the German Copyright Act permits reproductions of works that are necessary for the automated analysis of these works, unless the author has expressly reserved this type of use. There is, therefore, legal permission to collect works on a large scale and to create a training corpus from them. The exception also requires that the work has been lawfully accessed and is deleted once the training is completed and storage is no longer necessary.
If the author of the copyright works has expressly reserved its rights in the way described above, the Provider may rely on Section 44a of the German Copyright Act. This provision permits temporary reproductions of works that have no individual economic value; data mining is expressly named as an act falling under this exception. However, AI training can only be based on this exception if no training corpus is created, but the AI learns from works stored in the working memory, which is deleted again immediately afterward.
Finally, Section 60d of the German Copyright Act allows privileged research organizations to reproduce works even if granted an opt-out by the author of the work.

Do relevant defenses/exceptions exist (assuming the system is used for commercial purposes)?

Australia

Not likely. The Copyright Act 1968 (Cth) (CA 1968) ‘fair dealing’ defenses for copyright infringement include the following dealings: research or study, criticism and review, news reporting, or parody and satire.² However, the focus in Australia is also on what is considered to be ‘fair,’ and commercial objectives being the driving force behind the infringement are typically not considered to be fair. In addition, these defenses are rarely applied in Australia and are generally considered narrower than in other common law jurisdictions.

Canada

There is no text and data mining (TDM) exception under the Canadian Copyright Act (R.S.C., 1985, c. C-42), but two general exceptions may apply to training a Generative AI system: (1) the temporary reproduction for a technological process exception; and (2) the fair dealing exception.

To qualify for the temporary reproduction for technological processes exception, three requirements must be met:³

The reproduction must form an essential part of the technological process.
The reproduction should only be used to facilitate a use that is not an infringement of copyright.
The reproduction must exist only for the duration of the technological process.

A generative AI program that processes large datasets may need to make temporary reproductions of copyright material that are essential to its technological process. If reproductions are temporary and only exist for the duration of the dataset analysis, they may be covered by the exception of temporary reproduction for technological processes.

Similar to the US ‘fair use’ exception, the Canadian Copyright Act has a fair dealing exception. This allows use of copyright works for the purpose of research, private study, education, satire, parody, criticism, review or news reporting, provided that the use of the work is ‘fair’.⁴

If the purpose of use is for criticism, review or news reporting, then the source and author of the work must be cited.

Whether something is ‘fair’ will depend on the circumstances, and several factors will be considered in the analysis:⁵

The purpose of the dealing (Is it commercial or research/educational?)
The character of the dealing (What was done with the work? Was it an isolated use or ongoing, repetitive use? How widely was it distributed?)
The amount of the dealing (How much was copied?)
Alternatives to the dealing (Was the work necessary for the result? Could a different work have been used instead?)
The nature of the work (Is there a public interest in its dissemination? Was it previously unpublished?)
The effect of the dealing on the original work (Does the use compete with the market of the original work?)

When considering use in training generative AI systems, ‘research’ may be a relevant, fair dealing. The Supreme Court of Canada has held that ‘research is not limited to non-commercial or private contexts’ and should be otherwise afforded liberal interpretation.⁶

One Supreme Court case,⁷ for example, found that listening to 30- to 90-second music previews to determine a user’s musical preferences constituted research for the purposes of the fair dealing exception. However, a Canadian court has not considered whether training generative AI systems using copyrighted material is within the scope of the ‘research’ exception.

China

No. There is currently no TDM exception existing in the PRC copyright law system. Generally, the relevant exception to an infringement claim in PRC law only applies to non-commercial usage for personal study, research or appreciation or copying a small quantity for teaching or science research purposes. It will not apply to a system for substantial/large-scale commercial usage purposes.

EU

Yes. TDM (that is, reproduction and extraction) of lawfully accessible works is permitted for any purpose provided that the rights holder has not ‘expressly reserved’ its rights in an appropriate manner, which may be in a machine-readable way where the content is available online.⁸

For information on the regulation of AI in the EU, see our blog, The EU AI Act – the countdown begins.

France

Same as EU position.⁹

Germany

Same as EU position.

Hong Kong

No. There is no TDM exception under the Hong Kong Copyright Ordinance, and the use in training of a generative AI system is unlikely to fall within any of the fair dealing exceptions under that Ordinance.

The Netherlands

Same as EU position.

Singapore

Yes. There is a statutory exception permitting the copying of copyright works for the purpose of ‘computational data analysis’, which includes:

Using a computer program to identify, extract and analyze information or data from the work or recording; and
Using the work or recording as an example of a type of information or data to improve the functioning of a computer program relating to that type of information or data.¹⁰

The Intellectual Property Office of Singapore has clarified that ‘computational data analysis’ includes sentiment analysis,TDM and training machine learning.¹¹

However, the exception is subject to certain conditions and safeguards to protect the commercial interests of copyright owners:

The user cannot share copies of the works with others, except for verifying the results of the computational data analysis or for collaborative research or study relating to the purpose of such analysis.
The user must not use copies of the works made under this exception for any other purpose.
The user must have lawful access to the works to be copied; and
The work from which copies are made must not itself be an infringing copy (unless the use of infringing copies is necessary for a prescribed analysis) or, if it is an infringing copy: (i) the user must not know this; and (ii) if that copy was obtained from a flagrantly infringing online location, the user must not know (or reasonably have known) that.

While this statutory exception allows a generative AI system to be trained without infringing copyright (as long as the above conditions are met), there is still a risk that the Output of the generative AI system will infringe copyright.

For more information on:

TDM in Singapore, see our blog, New Singapore Copyright Exception will propel AI revolution.
AI governance in Singapore, see our blog, Singapore proposes Governance Framework for Generative AI.

South Africa

No, provided it constitutes a substantive reproduction or adapation of the original work (and authorship and ownership thereof can be proven), there would be no defense to copyright infringement in such circumstances.

UK

No. A statutory exception for TDM exists, but is only available for non-commercial research purposes.¹²

USA

The fair use doctrine may apply to protect the challenged activity; however, it has not been tested yet, and it is not clear to what extent it would apply.

It is highly likely that the training process will involve the reproduction of entire works or substantial portions. OpenAI, for example, acknowledges that its programs are trained on large, publicly available datasets that include copyright works, and that copies of such works are made as part of the process. The copying of copyright works without consent (express or implied) from the copyright owner may result in liability for copyright infringement.

It is expected that AI companies will argue that their training processes constitute fair use and, therefore, do not infringe any work copied. Whether or not copying constitutes fair use depends on four statutory factors under 17 U.S.C. § 107:

The purpose and character of the use, including whether such use is commercial or is for nonprofit educational purposes.
The nature of the copyright work.
The amount and substantiality of the portion used in relation to the copyright work as a whole.
The effect of the use upon the potential market for or value of the copyright work.

AI advocates are likely to argue that consideration of these factors requires a conclusion of fair use. For example, under the first factor, AI companies may argue that their purpose is ‘transformative’ because the training process creates a useful generative AI system, rather than an expressive work.

Under the third factor, note that the copies are not made available to the public but are used only to train the program, an argument in which a court may weigh in favor of a fair use conclusion.

In contrast, some generative AI applications have raised concern that training AI programs on copyright works allows them to generate works that compete with the original works. Such evidence would be considered under the fourth fair use factor and would likely weigh against a conclusion of fair use.

For information on the regulation of AI in the US, see our blog, President Biden issues sweeping artificial intelligence directives targeting safety, security and trust.

Footnotes

A computer scientist’s view of training is that it does not strictly involve the creation of a copy of the training data per se. Rather, the training data is transformed into a mathematical model that, in the case of a written source, converts the words into tokens and ‘learns’ the correlations between tokens. Nevertheless, the assumption is that the model can reproduce the training data (and for example ChatGPT can quote verbatim certain texts that it has apparently been trained on) and, if so, it may ultimately not matter to the copyright analysis in what form the data is stored within the model.

² Ss 40-42 CA.

³ s.30.71 of Copyright Act.

⁴ s.29 of Copyright Act.

⁵ CCH Canadian Ltd. v. Law Society of Upper Canada, 2004 SCC 13.

⁶ CCH Canadian Ltd. v. Law Society of Upper Canada, 2004 SCC 13.

⁷ Society of Composers, Authors and Music Publishers of Canada v. Bell Canada, 2012 SCC 36

⁸ Article 4 of the Digital Copyright Directive.

⁹ The exception provided under Article 4.3 of EU Digital Copyright Directive is reflected under Article L122-5-3 III of the French Intellectual Property Code.

Generative AI

Loss of confidentiality in the information used as the Prompt for a generative AI system

A concern relating to the use of public deployments of generative AI systems is that the Prompts that Users enter into the system can be reused by the Provider or Developer without restriction

Is the output of the generative AI system protected by intellectual property rights?

The approach and requirements for intellectual property rights to subsist in computer-generated works vary from jurisdiction to jurisdiction.

Infringement risk relating to creation and use of the output of a generative AI system

Where the Output of a generative AI system is the same or substantially similar to a third party’s copyright work

Which actors have potential liability for infringement?

A Deployer may be exposed to primary liability for copyright infringement as a result of the Output of a generative AI system infringing copyright

Subscribe and stay up to date with the latest legal news, information and events . . .