Introduction

Developing high-performance generative AI systems and other AI systems based on machine learning often requires access to vast amounts of data for training (AI training data) and improving their accuracy and performance, and data scraping is an approach that is taken to generate large enough data sets.  For example, there are crawler tools that compile a web archive dataset that includes both copyrighted work and open-source work. The resulting data sets can be very large (petabytes of data).  

These datasets have been used by researchers, developers, and analysts for cross-domain progress in fields such as language processing, search engine optimization, and web analytics, among others. A challenge is that it’s difficult for these crawlers to distinguish between copyrighted / licensed materials, and open-source materials, especially as many broadly accessible data sources do not accurately identify data licensing terms, and it is very challenging to label accurately at scale. Accordingly, there are on-going concerns about the inclusion of copyrighted materials in these large data sets and their derivatives. 

Scraping can directly affect creators and owners of IP-protected works, especially when conducted without consent or payment to rights holders, and the OECD’s white paper contrasts data scraping with other similar and related activities, such as “data mining,” and “web crawling.” Reference is made to the OECD AI Principles, as well as the OECD Recommendation on Enhancing Access and Sharing of Data, noting the balance between concerns regarding privacy, data protection, and intellectual property and attempting to maximize the benefits of data access and data sharing. 

Data scraping can also be conducted by researchers using web scripts or other smaller scale automation approaches against various data sources, and while there can be terms of service / terms of use prohibiting this behaviour operating alongside robot directives such as robots.txt files or HTTP headers that implement the Robots Exclusion Protocol, actual compliance with these prohibitions or directives are often not technologically enforced. The protocol effectively trusts the crawler processes to respect the specific directives. 

Some data sources implement technical protection measures (TPMs) to restrict the activities of automated crawler processes, such as by technologically enforcing digital rights management (DRM) policies, but these can be challenging and expensive to implement without significantly impacting the functioning of a tool or a website. 

As an alternative to data scraping, there can also be curated collections of data available from dataset providers, including open-source data from academic preprint servers, paid repositories (e.g., of stock images), and licences can be obtained for copyrighted materials.

Consultations

The OECD’s white paper on IP issues in AI trained on scraped data is based on discussions by the OECD Working Party on AI Governance at its November 2023 and June 2024 meetings. These considerations are at the forefront of several on-going disputes, where allegations have been raised both for IP infringement as well as breach of contract.  

Data scraping – proposed working definition

The white paper proposes a broad working definition of scraping as the automated extraction of information from third-party websites, databases, or social media platforms.  The automated processes can include web scraping, web crawling, screen scraping, among others. 

There are three general characteristics of scraping noted in the whitepaper:

  • Automation: Data scraping typically involves using software tools or scripts designed to quickly and efficiently harvest or otherwise aggregate data with minimal human intervention.
  • Scalability: Data scraping is often used to collect or make accessible large amounts of data that would be impractical to aggregate manually. In addition, the tools and techniques employed can be scaled up to extract data from numerous sources simultaneously.
  • Lack of coordination: Data scraping is often done without coordination between the data scraper and the entity hosting the data.

Suggested policy approaches

The white paper discusses different approaches to address issues posed by data scraping. These include a code of conduct, technical tools, standard contract terms, and raising awareness. 

  • In a code of conduct, a number of suggested preliminary terms are noted, including specific potential principles relating to different topics, placing different expectations and commitments on AI operators (LLM developers and other users of scraped data), such as obligations around data acquisition, technical safeguards, and end user agreements and education about acceptable behaviours, among others.  
  • For technical tools, the white paper notes a disconnect between the restrictions stated in website terms of service and the actual technical measures in place, as many websites do not correctly configure their robots.txt files to reflect contractual restrictions, and proposes mechanisms such as rights management safeguards, standardized opt-out tools, and tracking mechanisms.  
  • Standard contract terms are proposed as a mechanism to help address the cross-border issues as different terms and meanings can vary between jurisdictions. Standard terminology can help accommodate different needs and bargaining positions, setting an optional starting point and allowing for organizations to negotiate bespoke arrangements, when appropriate.  A Statement on AI Training, signed by over 6,500 creators, is highlighted in respect of the importance of standard terms.
  • Finally, the white paper also describes stakeholder awareness and education, including user directions on how to avoid prompts that are likely to circumvent technical safeguards or violate rights.

 


For more information, please contact your IP professional at Norton Rose Fulbright Canada LLP.

For a complete list of our IP team, click here.



Recent publications

Subscribe and stay up to date with the latest legal news, information and events . . .