Fair Use: Training Generative AI

by Stephen Wolfson Better Internet, Copyright, Licenses & Tools, Open Creativity, Technology

Generated by AI: An oil painting in the style of Pieter Jansz Saenredam of a robot learning to follow a recipe in a Dutch kitchen with a large collection of tiny artworks arranged haphazardly on shelves.

“Robot Training” by Creative Commons was generated by the DALL-E 2 AI platform with the text prompt “an oil painting in the style of Pieter Jansz Saenredam of a robot learning to follow a recipe in a Dutch kitchen with a large collection of tiny artworks arranged haphazardly on shelves.” CC dedicates any rights it holds to the image to the public domain via CC0.

While generative AI as a tool for artistic expression isn’t truly new — AI has been used to create art since at least the 1970s and the art auction house Christie’s sold its first piece of AI artwork in 2018 — the past year launched this exciting and disruptive technology into public awareness. With incredible speed, the development and widespread availability of amazing tools like Stable Diffusion and Midjourney have engendered excitement, debate, and indeed fear over what the future may hold and what role generative AI should have in the production of creative works.

Perhaps unsurprisingly to anyone who has been paying attention to the conversation around generative AI, the past year also saw the first lawsuits challenging the legality of these tools. First, in November, a group of programmers sued Github and OpenAI over the code generation tool, Github Copilot, alleging (among other things) that the tool improperly removes copyright management information from the code in its training data, in violation of the Digital Millennium Copyright Act, and reproduces code in its training data without following license agreement stipulations like attributing the code to its original author. Then, in January, a group of artists (represented by the same attorneys as in the Github lawsuit) sued Stability AI and Midjourney over their text-to-image art generation tools. In this second lawsuit, the artist-plaintiffs made several claims, all of which deserve discussion. In this blog post, I will address one of those claims: That using the plaintiffs’ copyrighted works (and as many as 5 billion other works) to train Stable Diffusion and Midjourney constitutes copyright infringement. As Creative Commons has argued elsewhere, and others agree, I believe that this type of use should be protected by copyright’s fair use doctrine. In this blog post I will discuss what fair use is, what purpose it serves, and why I believe that using copyrighted works to train generative AI models should be permitted under this law. I will address the other claims in both the Github and Stable Diffusion lawsuits in subsequent blog posts.

Copyright for public good

It is clear from both the history and origin of copyright law in the United States that copyright’s purpose is to serve the public good. We can see this in the Constitution itself. Article I, section 8, clause 8 of the U.S. Constitution gives Congress the power to create copyright law. This provision states that copyright law must “promote the Progress of Science and useful Arts” and that copyright protection can only last for “limited times.” As such, any copyright law that Congress passes must be designed to support the creation of new creation works and that copyrights must eventually expire so that the collection of works that are free for us all to use — the public domain — will grow and nurture further creative endeavors. However, even while the ultimate beneficiary of copyright may be the public, the law attempts to achieve these goals by giving rightsholders several specific ways to control their works, including the right to control the reproduction and distribution of copies of their works.

With this design, copyright law attempts to strike a balance between the interests of both rightsholders and the public, and when that balance breaks down, copyright cannot achieve its goals. This is where fair use comes from. Shortly after the first copyright law, courts began to realize that it would frustrate copyright’s ability to benefit the public if rightsholders had an unlimited right to control the reproduction and distribution of their works. So, in 1841, Judge Joseph Story first articulated what would become eventually the modern test for fair use in Folsom v. Marsh. As part of that decision, he wrote that downstream uses of copyrighted works that do not “supersede the objects” of the original works should be permitted under the law.

Generated by AI: A chrome-skinned robot face with a blue glow behind its eyes and nose, looking out from the inside of a complex black and white machine.

“Fair Use Training Generative AI” by Creative Commons was generated by the Stable Diffusion AI platform with the text prompt “Fair Use Training Generative AI.” CC dedicates any rights it holds to the image to the public domain via CC0.

What is fair use?

Today, fair use, codified at 17 USC 107, is unquestionably an essential part of copyright law in the United States. Courts, including the Supreme Court, have repeatedly emphasized the importance of fair use as a safeguard against the encroachment on the rights of people to use copyrighted works in ways that rightsholders might block. Unfortunately, however, fair use is a famously hard doctrine to apply. Courts repeatedly write that there are no bright lines in what is or is not fair use and each time we consider fair use we must conduct a case-by-case analysis. To that end, the law requires courts to consider four different factors, in light of the purpose and goal of copyright law. These factors are: 1. The purpose and character of the use, or what the user is doing with the original work; 2. The nature of the original work; 3. The amount and substantiality copied by the secondary use; and 4. Whether the secondary use harms the market for the original.

Even though there are no bright lines, there are some principles we can look to when weighing the four fair use factors that courts tend to consider for a finding of fair use and that are particularly relevant for how we may think about fair use and generative AI training data. First, and perhaps most importantly, is whether the secondary use “transforms” the original in some way, or if it “merely supersede[s]” the original. Since 1994, when the Supreme Court adopted “transformativeness” as part of the inquiry about the purpose and character of the secondary use in Campbell v. Acuff Rose Music, this question has grown increasingly important. Today, if someone can show that their secondary use transforms the original in some way, it is much more likely to be fair use then otherwise. Importantly, however, last October, the Supreme Court heard Andy Warhol Foundation v. Goldsmith, which may change how we approach transformativeness in fair use under U.S. law. Nevertheless, it still seems likely that highly transformative uses will weigh in favor of fair use, even after the decision in that case. Second, when considering the nature of the original work, we need to remember that copyright protects some works a bit more strongly than others. Works that are fiction or entirely the creative products of their authors are protected more strongly than nonfiction works because copyright does not protect facts or ideas. As such, uses of some works are less likely to be fair use than others. Third, we need to think about how much of the original work is copied in the context of the transformativeness inquiry, and whether the amount copied serves the transformative purpose. If the amount copied fits and supports the transformative purpose, then fair use can support copying entire works. Fourth, when we consider market harm, we need to think about whether the secondary use undermines the market for or acts as a market substitute for the original work. And finally, we need to consider whether permitting a secondary use as a fair use would serve the goals of copyright.

Is AI transformative?

Given all this background on fair use, how do we apply these principles to the use of copyrighted works as AI training data, such as in the Stable Diffusion/Midjourney case? To answer this question, we must first look at the facts of the case. Dr Andrés Guadamuz has a couple excellent blog posts that explain the technology involved in this case and that begin to explain why this should constitute fair use. Stability AI used a dataset called LAION to train Stable Diffusion, but this dataset does not actually contain images. Instead, it contains over 5 billion weblinks to image-text pairs. Diffusion models like Stable Diffusion and Midjourney take these inputs, add “noise” to them, corrupting them, and then train neural networks to remove the corruption. The models then use another tool, called CLIP, to understand the relationship between the text and the associated images. Finally, they use what are called “latent spaces” to cluster together similar data. With these latent spaces, the models contain representations of what images are supposed to look like, based on the training data, and not copies of the images in their training data. Then, user focused applications collect text prompts from users to generate new images based on the training data, the language model, and the latent space.

Turning back to fair use, this method of using image-text combinations to train the AI model has an inherently transformative purpose from the original images and should support a finding of fair use. While these images were originally created for their aesthetic value, their purpose for the AI model is only as data. For the AI, these image-text pairs are only representations of how text and images relate. What the images are does not matter for the model — they are only data to teach the model about statistical relationships between elements of the images and not pieces of art.

This is similar to how Google used digital copies of print books to create Google Books, a practice that was challenged in Author’s Guild v. Google (Google Books). In this case, the Second Circuit Court of Appeals found that Google’s act of digitizing and storing copies of thousands of print books to create a text searchable database was fair use. The court wrote that Google’s purpose was different from the purpose of the original authors because Google was not using the books for their content. Indeed, the content did not really matter to Google; rather the books were like pieces of data that were necessary to build Google’s book database. Instead of using the books for their content, Google’s purpose was to create a digital tool that permitted new ways of using print books that would be impossible in the analog world. So, the books as part of Google’s database served a very different purpose from their original purpose, which supported the finding of fair use in this case.

Moreover, it is also similar to how search engine operator Arriba Soft used copies of images in its search engine, which was litigated in Kelly v. Arriba Soft. In this case, a photographer, Leslie Kelly, sued the operator of a search engine, Arriba Soft, for copying and displaying copies of her photographs as thumbnails to users. The court, however, disagreed that this constituted copyright infringement. Instead, the court held that this use served a different and transformative purpose from the original purpose because Arriba Soft only copied Kelly’s photographs to enable its search engine to function and not because of their aesthetic value. Like Google Books, and like AI training data, the images here served a function as data for the tool, not as works of art to be enjoyed as such.

On the nature of works as AI inputs

Turning to factor two, the nature of the original works, even though we do not know what specific images are in the LAION dataset used to train Stable Diffusion and Midjourney, it is likely that these images involve a wide range of creativity. While this could weigh against a finding of fair use for Stable Diffusion and Midjourney, given the presumably creative nature of the input works, this factor is rarely determinative. In fact, in Google Books, the court was skeptical that this factor would weigh against fair use even if the books in the database were fiction. This is because using the digitized books as part of the database provided information about the books and did not use them for their creative content. Similarly in the litigation against Stable Diffusion and Midjourney, these generative AI tools use the works in their dataset as data. In this, anything they extract from their training data might only be unprotectable elements of the works in the training data, such as facts and ideas about how various concepts are visualized. As such, because this factor is rarely a major factor in fair use decisions, it seems unlikely that this factor should weigh heavily against fair use in this case.

Is AI making copies?

Third, because of how the generative AI models work, they use no more of the original works than is necessary when used for training to enable the transformative purpose. The models do not store copies of the works in their datasets and they do not create collages from the images in its training data. Instead, they use the images only as long as they must for training. These are merely transitory copies that serve a transformative purpose as training data. Again, Google Books is helpful to understand this. In that decision, the Court wrote that Google needed to both copy and retain copies of entire books for its database to function. But this was permissible because of Google’s transformative purpose. Furthermore, Google did not permit users to access full copies of the books in the database, but instead, it only revealed “snippets” to the users. On this point, the court wrote that the better question to answer was not how much of the works Google copied, but instead how much was available to users. Similarly, Stability AI and Midjourney would not work unless they used the entire images in their training datasets. Moreover, they do not store images, they do not reproduce images in their data sets, and they do not piece together new images from bits of images from their training data. Instead, they learn what images represent and create new images based on what they learn about the associations of text and images.

AI in the marketplace

Fourth, the issue of whether Stable Diffusion and Midjourney harm the market for the works in their training data is difficult, in part because the way that courts think of this question can be a bit inconsistent. In one way, the answer must be yes, this use at least has the potential to harm the market for the original. That is, after all, one likely reason the plaintiffs filed this lawsuit in the first place — they are afraid that AI generated content will cut into their ability to profit off of their art. Indeed, any art has the potential of competing with other art, not necessarily because it fills the same niche, but because attention is limited, and AI generated content has the advantage of being able to be made in a quick, automated fashion. However, this may not be the best way to think about market harm in the context of using images as training data. As mentioned above, we need to think about this question in the context of the transformative purpose. In Campbell v. Acuff Rose, the Supreme Court wrote that the more transformative the purpose, the less likely it is that it will be a market substitute for the original. Given this, perhaps it is better to ask whether this use as training data, not as pieces of art, harms the market for the original. This use by Stability AI and Midjourney exists in an entirely different market from the original works. It does not usurp the market of the original and it does not operate as a market substitute because the original works were not in the data market. Moreover, this use as training data does not “supersede the objects” of the originals and does not compete in the aesthetic market with the originals.

Training AI as fair use

Finally, as discussed above, since the purpose of copyright law is to encourage the new creative works, to promote learning, and to benefit the public interest, fair use should permit using copyrighted works as training data for generative AI models like Stable Diffusion and Midjourney. The law should support and foster the development of new technologies that can provide benefits to the public, and fair use provides a safeguard against the cudgel of copyright being used to impede these technologies. As Mark Lemley and Bryan Casey write in a recent paper arguing that this type of use should constitute fair use: “A central problem with allowing copyright suits against ML [machine learning] is that the value and benefit of the system’s use is generally unrelated to the purpose of copyright.” In fact, the Supreme Court has recognized fair use’s importance in the development of new technologies, first in 1984, in Universal City Studios v. Sony and most recently in 2021 in Google v. Oracle. In Sony, the Court held that the Betamax videocassette recorder should not be sued out of existence even if it could potentially help people violate copyright law. Instead, because it held “substantial, non infringing uses”, the Court believed copyright law should not be used to stop it. Then in Google, the Court held that Google’s use of Google’s 11,500 lines of Java code was fair use, writing that the courts must consider fair use in the context of technological development.

Altogether, I believe that this type of use for learning purposes, even at scale by AI, constitutes fair use, and that there are ways outside of litigation that can offer authors other ways to control the use of their works in datasets. We can already see an example of this, to a degree, when Stability AI announced that it would permit artists to opt out of having their works used for training Stable Diffusion. While this certainly isn’t a perfect solution, and opt-out is just one possible way to approach these issues, it is at least a start, and it highlights that there are ways to address these problems other than through copyright-based solutions. Perhaps by looking at norms and best practices and by engaging people in collaboration and dialogue we can better address the concerns raised by AI training data, instead of falling back on lawsuits that force the different sides of this issue into opposition and that can create unpredictable and potentially dangerous new precedent for future technologies.

Like the rest of the world, CC has been watching generative AI and trying to understand the many complex issues raised by these amazing new tools. We are especially focused on the intersection of copyright law and generative AI. How can CC’s strategy for better sharing support the development of this technology while also respecting the work of human creators? How can we ensure AI operates in a better internet for everyone? We are exploring these issues in a series of blog posts by the CC team and invited guests that look at concerns related to AI inputs (training data), AI outputs (works created by AI tools), and the ways that people use AI. Read our overview on generative AI or see all our posts on AI.

Note: We use “artificial intelligence” and “AI” as shorthand terms for what we know is a complex field of technologies and practices, currently involving machine learning and large language models (LLMs). Using the abbreviation “AI” is handy, but not ideal, because we recognize that AI is not really “artificial” (in that AI is created and used by humans), nor “intelligent” (at least in the way we think of human intelligence).

Posted 17 February 2023