Original source: Tencent Technology
Image source: Generated by Unbounded AI
On June 28, 2023, the first representative ChatGPT copyright infringement lawsuit finally appeared in the public eye. Two writers filed a copyright class action against Open AI in the Northern District Court of California, accusing the latter of using their copyrighted books to train ChatGPT without authorization for commercial gain.
The plaintiffs Paul Tremblay and Mona Awad live in Massachusetts and respectively own the copyrights of the works involved in the case “The Cabin at the End of the World” and “13 Ways of Looking at a Fat Girl and Bunny”; the defendant Open AI created and operated the generative The artificial intelligence product ChatGPT is currently mainly driven by two underlying large language models, GPT-3.5 and GPT-4.
The complaint pointed out that although the plaintiff did not authorize Open AI to use its own copyrighted books for model training, ChatGPT was able to output book summaries according to the s command, which could only happen if the defendant included the books involved in the corpus for training.
The plaintiff stated that a large amount of content contained in the Open AI training data set is a copyrighted work, including books for which the plaintiff has copyright. However, Open AI neither obtained the consent of the plaintiff, nor indicated the source of the content, nor paid the necessary fees. The books published by the plaintiff have clear copyright management information, including the publication number, copyright number, name of the copyright owner, and terms of use.
**The plaintiff can infer from the existing facts and information that the only explainable reason why ChatGPT can accurately generate a summary of a specific book is that Open AI obtained and copied the book involved, and used it for its large language model (GPT3.5 or GPT4) training. **
The plaintiff’s test found that when ChatGPT was asked to summarize the two books involved in the case through s, ChatGPT could generate a more accurate summary (although there were also a small amount of wrong content). This shows that ChatGPT preserves the content of a specific work in the training dataset and is able to output the corresponding text. At the same time, through the design of the content generation principle of the large language model, the output content of ChatGPT will not contain the original copyright management information.
**The interesting thing about this case is that in the process of proving Open AI’s infringement, the plaintiff’s introduction to the basic principles of ChatGPT was based on having a dialogue with ChatGPT and asking him to “introduce himself”. The specific content is summarized as follows. **
Open AI has released a series of large language models, including GPT-1 (2018.6), GPT-2 (2019.2), GPT-3 (2020.5), GPT-3.5 (2022.3) and The latest GPT-4 (2023·3). Generally speaking, artificial intelligence software aims to use statistical methods to simulate human logic and reasoning through algorithms. A large language model is a type of specialized artificial intelligence software used to parse and output natural language.
**On the one hand, Open AI provides ChatGPT to users through a web page at a price of $20 per month. **Users can choose two versions of ChatGPT, the GPT-3.5 model or the updated GPT-4 model. ** On the other hand, ChatGPT is also provided to software developers in the form of API. **API interface allows developers to write programs for data exchange with ChatGPT, in this case it is billed according to usage.
** Whether the service is provided in the form of a page or an API, ChatGPT will actively respond to the user’s request. **If the user asks ChatGPT a question, it will give the answer; if the user gives ChatGPT an instruction, ChatGPT will execute it; if the user asks ChatGPT to summarize a summary of a book, ChatGPT will still do it.
The plaintiff’s point of view is that, unlike traditional software, which is written by engineers, the large language model is developed through “training”—collecting massive content corpora from different sources and “feeding” them to the model. Known as the training dataset (training dataset).
The large language model will constantly adjust its output to be as close as possible to the sequence of text combinations in the trained works. ** It is worth noting that although many contents are used to train large language models, books have always been the core corpus materials in the training dataset because they provide the best examples of high-quality long-form writing. **
In the corporate paper “Improving Language Understanding Through Generative Pre-Training” published in June 2018, Open AI disclosed that the training of GPT-1 relies on the “BookCorpus” dataset. “BookCorpus” contains 7,000 books in different fields such as adventure, fantasy and romance. **Open AI pointed out that the reason why books are particularly important as a training corpus is because they contain long continuous texts, which allows generative models to learn how to process long text information. **
** Many artificial intelligence research and development companies, including Open AI, Google, Amazon, etc., use “BookCorpus” for model training. ** In 2015, an artificial intelligence research team created this data set, which contains books from the Smashwords.com website, but “BookCorpus” did not obtain authorization from the copyright owner when including these books.
By publicly searching Open AI’s initiative to disclose information (enterprise papers), the plaintiff hopes to demonstrate that the training of the GPT series of models is based on the unauthorized use of massive book content. **In the corporate paper “Language Models Are Small Sample Learners” published in July 2020, Open AI disclosed that 15% of the content in the GPT-3 training data set came from two electronic databases named “Books1” and “Books2”. Book corpus. **
Although Open AI did not explain the specifics of the contents of “Books1” and “Books2”, it can be inferred from relevant clues: first, the two corpora are from the Internet; second, the scale of the two corpora is significantly larger than “BookCorpus”. According to the disclosure of Open AI, the scale of “Books1” is 9 times that of BookCorpus (about 63,000 books), and Books2 is 42 times (about 294,000 books). **In reality, only a very small number of databases can provide such a large-scale book corpus. On the one hand, “Books1” probably comes from “Project Gutenberg” or “Project Gutenberg Corpus Standardization”. **Project Gutenberg is an online library of e-books “beyond the term of copyright protection”. In September 2020, Project Gutenberg announced that it had included more than 60,000 books. Because it is not protected by copyright, Project Gutenberg has been widely used for artificial intelligence model training. In 2018, an artificial intelligence research team created the “Standardized Project Gutenberg Corpus” (Standardized Project Gutenberg Corpus) of more than 50,000 books based on the “Project Gutenberg”. **On the other hand, “Books2” is most likely derived from the “Shadow Library” on the Internet. **The “Books2” data set contains approximately 29,400 books, and only the much-criticized “shadow library” can provide such a large-scale book corpus. Examples include Library Genesis, Z-Library, Sci-Hub, and Bibliotik, among others. The term “Shadow Library” was coined by the Social Science Research Council of the United States in the article “Media Piracy in Emerging Economies” published in 2011. In March 2023, Open AI released the GPT-4 enterprise paper, but stated that “in consideration of the industry competition situation and product application security, the structure and content of the training data set will no longer be disclosed.”
**The plaintiff filed a total of six allegations against Open AI, the first three involving copyright infringement, the fourth involving unfair competition, and the fifth and sixth involving two basic types of civil liability—duty of care and unjust enrichment. **
**First, direct infringement of copyright. **The plaintiff did not authorize Open AI to reproduce or make derivative works of its books, nor did it authorize Open AI to publicly display or distribute the above-mentioned reproductions or derivative works.
In addition, the plaintiff emphasized that since the Open AI large language model needs to extract and save expressive information from the plaintiff’s books to operate, the large language model itself constitutes an infringing derivative work without the plaintiff’s authorization.
**Second, copyright substitute infringement. **The plaintiff emphasized that in the absence of authorization, each output of the big model constitutes an infringing derivative work. Because it has the right and ability to control the content output of the large language model and obtain economic benefits from it, Open AI constitutes a copyright substitution infringement.
Under the American case law system, “substitute infringement”, “aiding infringement” and “abetting infringement” together constitute a complete system of indirect copyright infringement. Indirect infringement is opposite to direct infringement, which means that although the infringer does not directly engage in the behavior regulated by the exclusive rights of copyright (that is, direct copyright infringement), it provides certain conditions for the direct infringement of copyright.
**Third, it violates the provisions of copyright management information in DMCA. ** From the perspective of product design mechanism, the content output by ChatGPT will not retain the “copyright management information” (CMI) of the work, so the defendant’s behavior of deliberately removing the copyright management information of the plaintiff’s work violates the “Digital Millennium Copyright Act” (DMCA) Provisions. In addition, defendants also violated the DMCA by distributing infringing derivative works without copyright management information without authorization.
“Copyright management information” is information that can identify the owner of a work, ownership of rights, and conditions of use. Whether in the United States or my country, it is illegal to delete or change copyright management information, or to make available to the public works with deleted or changed copyright management information.
**Fourth, unfair competition. **Open AI’s unauthorized use of plaintiff’s copyrighted work for model training is a violation of the California Business and Professions Code because it is improper, immoral, coercive, and detrimental to consumers Benefit.
The defendant deliberately designed ChatGPT to output snippets and abstracts of the plaintiff’s work without indicating the source of the content. ChatGPT develops commercial products to gain unfair benefits and reputation by concealing the author and copying the content and opinions of the infringed works.
**Fifth, negligent infringement is a violation of the duty of care. **Open AI needs to bear the duty of care stipulated in the “California Civil Code” - all people should adopt a reasonable behavior towards others. This obligation is based on industry custom, business practice, the information in the defendant’s possession, and the ability to control based on the information.
Once the defendant collects the copyrighted works of the plaintiff for the purpose of training the GPT model, it needs to bear a certain duty of care: when it is foreseeable that the unauthorized use of the works for model training will cause damage to the plaintiff, it should not infringe and use these works again .
** Sixth, unjust enrichment. **Plaintiff devoted substantial time and effort to the creation of the book in question. Because his own work was used to train the GPT model without authorization, the plaintiff was deprived of the right to profit from the work. It is unfair to the defendant to obtain commercial benefits by using the plaintiff’s work to train the GPT model. Unless prohibited or limited, the defendant’s conduct would cause irreparable harm to the plaintiff.
** Written at the end: three issues to be discussed in this case. **
**As the first representative lawsuit of ChatGPT’s copyright infringement, it will still be a long process before the Northern District Court of California makes a formal judgment. But before that, there are still some issues worthy of attention and consideration regarding the specific content of the plaintiff’s complaint. **
**Concern 1: It is not easy to find model infringement. **
The training of large language models is essentially a kind of internal and non-explicit behavior of using works, and copyright owners have the real problem of finding out that their works have been infringed. Generally speaking, only by comparing the content generated by the model with its own work is substantially similar, it can be deduced that there is unauthorized use of the work during the model training stage. In this case, the reason why the plaintiff was able to accuse that his book was infringed by the large language model under Open AI was that he discovered that ChatGPT had output a summary of his own work.
But whether this claim holds water remains to be seen. **If the abstract of the work output by ChatGPT is only based on the collection of public introduction materials of the plaintiff’s books on the Internet, rather than directly copying and training the plaintiff’s books, then the legitimacy of the infringement allegation will be shaken. **The plaintiff also admitted that there are a few factual errors in the summary of the book output by ChatGPT, which also indicates to a certain extent that the large model may not have fully studied the books involved.
**Concern 2: What kind of rights are violated needs to be demonstrated. **
At present, although the “storage of work data” can formally fall under the regulation of the “right of reproduction” in the Copyright Law, whether the core “training behavior of work data” is infringing and what kind of rights in the copyright law have not yet been infringed. There are unanimous conclusions. In this case, the plaintiff emphasized that the normal operation and content output of the large language model is based on the training of the corpus of the work, so the training of the large model constitutes copyright infringement, and the large model itself constitutes an infringing derivative work.
This claim also remains to be explored. **Except for a few special content generation requirements such as “requiring generalization, summarization, and translation of specific copyright works in the form of s” in this case, in most cases the large model accepts open content generation instructions (not limited to specific works, specific writer’s style), it will basically not output specific works or even fragments of specific works, so it does not constitute copyright infringement. **
**Concern 3: Upstream and downstream responsibilities need to be clarified. **
In the field of large model copyright, the model developer has relevant rights to the large model itself, so he bears the copyright responsibility involved in model training; as for the output content of the large model, judging from the current industry practice, the common practice is to clarify the rights through contracts and responsibility belong to the user. On July 10, 2023, the “Interim Measures for the Management of Generative Artificial Intelligence Services” issued by the Cyberspace Administration of China also clearly recognized that “providers should sign service agreements with users to clarify the rights and obligations of both parties.”
**It is worthy of attention. Judging from the plaintiff’s claim, it also follows the two stages of model training and content output, and the idea of dividing rights and responsibilities. **The plaintiff’s claim for direct copyright infringement focuses on the Open AI model training stage: first, copies of books were made during the model training process without the plaintiff’s authorization; second, without the plaintiff’s authorization, the large language model itself constitutes an infringing derivative work. **The plaintiff’s allegation of infringement of the output content of ChatGPT is only to claim that Open AI constitutes an indirect infringement of copyright (substitute infringement). This also means that for the output content of the large model, the user is responsible for the direct infringement of the copyright, because it has the corresponding rights. **