If an AI program scans a book, is that copyright infringement or fair use?

As artificial intelligence programs have become ubiquitous over the past year, so have lawsuits from authors and other creative professionals who argue that their work has been essential to that ubiquity—the “large language models” (or LLMs) that power text-generating AI tools are trained on content that has been scraped from the web, without its authors’ consent—and that they deserve to be paid for it. Last week, my colleague Yona Roberts Golding wrote about how media outlets, specifically, are weighing legal action against companies that offer AI products, including OpenAI, Meta, and Google. They may have a case: a 2021 analysis of a dataset used by many AI programs showed that half of its top ten sources were news outlets. As Roberts Golding noted, Karla Ortiz, a conceptual artist and one of the plaintiffs in a lawsuit against three AI services, recently told a roundtable hosted by the Federal Trade Commission that the creative economy only works “when the basic tenets of consent, credit, compensation, and transparency are followed.”

As Roberts Golding pointed out, however, AI companies maintain that their datasets are protected by the “fair use” doctrine in copyright law, which allows for copyrighted work to be repurposed under certain limited conditions. Matthew Butterick, Ortiz’s lawyer, told Roberts Golding that he is not convinced by this argument; LLMs are “being held out commercially as replacing authors,” he said, noting that AI-generated books have already been sold on Amazon, under real or fake names. Most copyright experts would probably agree that duplicating a book word for word isn’t fair use. But some observers believe that the scraping of books and other content to train LLMs likely is protected by the fair use exception—or, at least, that it should be. In any case, new debates around news content, copyright, and AI are building on similar debates around other types of creative content—debates that have been live throughout AI’s recent period of rapid development, and that build on much older legal concepts and arguments.   

Determining whether LLMs training themselves on copyrighted text qualifies as fair use can be difficult even for experts—not just because AI is complicated, but because the concept of fair use is, too. According to a 1990 Supreme Court ruling, the doctrine was initially intended to counterbalance decisions under the Copyright Act of 1976 that might inadvertently “stifle the very creativity which that law is designed to foster.” The US Copyright Office notes that the Act lists certain types of activity—such as criticism, comment, and news reporting—as examples of activities that qualify under the exemption. But judges deciding such cases have to take into account four separate and in some cases competing factors: the purpose of the use and whether it is “transformative,” the nature of the copyrighted work, the amount of the work used, and what effect the use has on the market for the original.

Note: This was originally published as the daily newsletter for the Columbia Journalism Review, where I am the chief digital writer

The courts are more likely to find that nonprofit uses are fair, but the Copyright Office notes that this doesn’t mean that all nonprofit uses are fair and that all commercial uses are not. And while the use of short excerpts of original works are more likely to be fair, some courts have found the use of an entire work to be fair if that use is seen as transformative—that is to say, if it has added something meaningfully new or used the work in a different way to that which was initially intended. When it comes to AI, the heart of the issue is the debate over what exactly an LLM does. Does it copy entire books in order to reproduce them? Or does it simply add the words in those books to its database, in order to answer questions and generate new content?

Earlier this year, Matthew Sag, a law professor at Emory University, told a US Senate subcommittee that technically, AI engines do not copy original works but rather “digest” them, in order to learn how human language functions. Rather than thinking of an AI engine as copying a book “like a scribe in a monastery,” Sag said, it makes more sense to think of it as learning from the data, like a student would. Joseph Paul Cohen, director of a file-sharing service called Academic Torrents, told Wired recently that great authors typically read the books that came before theirs. “It seems weird that we would expect an AI author to only have read openly licensed works,” Cohen said.

If we see LLMs as merely adding content to a database in order to generate better results for users, this would seem very similar to how search engines like Google work. And Google has won two important copyright cases that seem relevant to the AI debate. In 2006, the company was sued by Perfect 10, an adult entertainment site that claimed that Google had infringed its copyright by generating thumbnail photos of its content; the court ruled that providing images in a search index was “fundamentally different” to simply creating a copy, and that in doing so, Google had provided “a significant benefit to the public.” In the other case, the Authors’ Guild, a professional organization that represents the interests of writers, sued Google for scanning more than twenty million books and showing short snippets of text when people searched for them. In 2013, a judge in that case ruled that Google’s conduct constituted fair use because it was transformative.

In 2019, the US Patent and Trademark Office requested input on questions around intellectual property protection and AI. OpenAI responded that it believes that the training of AI systems counts as a “highly transformative” use of copyrighted works, because the latter are meant for human consumption, whereas the training of AI engines is a “non-expressive” activity aimed at helping software learn the patterns in language. Although the company conceded that its software scans entire works, it argued that the more important question is how much of a work is shown to a user. This argument was also a factor in the Google Books case. 

Perhaps unsurprisingly, the Copyright Alliance, a nonprofit group that represents authors and other creative professionals, has taken issue with comparisons between Google’s scanning of books and AI engines’ training of LLMs: unlike the former, the group says, the latter does not make provisions to acknowledge “factual information about the copyrighted works” and link out to where users can find them. Instead, the Alliance argues that most generative AI programs “reproduce” expressive elements from copyrighted works, thereby creating new texts that often act as “market substitutes” for the originals they were trained on. Several of the recent copyright suits against AI services have referred to Books3, a large open-source database that Shawn Presser, an independent AI researcher, created from a variety of online sources, including so-called “shadow libraries” that host links to pirated versions of books and periodicals. Presser has argued, in response, that deleting databases like his risks creating a world in which only billion-dollar tech companies with big legal budgets can afford to create AI models.

According to a recent analysis by Alex Reisner in The Atlantic, the fair-use argument for AI generally rests on two claims: that generative-AI tools do not replicate the books they’ve been trained on but instead produce new works, and that those new works “do not hurt the commercial market for the originals.” Jason Schultz, the director of the Technology Law and Policy Clinic at New York University, told Reisner that there is a strong argument that OpenAI’s work meets both of these criteria. Elsewhere, Sy Damle, a former general counsel at the US Copyright Office, told a House subcommittee earlier this year that he believes the use of copyrighted work for AI training is categorically fair (though another former counsel from the same agency disagreed). And Mike Masnick of Techdirt has argued that the legality of the original material is irrelevant. If a musician were inspired to create new music after hearing pirated songs, Masnick asks, would that mean that the new songs infringe copyright?

As Reisner notes, some observers are concerned that AI indexing will change the incentives of the existing copyright system. If an AI program can scrape copyrighted works and turn out something in a similar style, artists could be less likely to create new works. But some authors seem sanguine about the prospect that their works will be scraped by AI. “Would I forbid the teaching (if that is the word) of my stories to computers?” Stephen King asked recently. “Not even if I could. I might as well be King Canute, forbidding the tide to come in. Or a Luddite trying to stop industrial progress by hammering a steam loom to pieces.”

Going forward, news organizations try to hammer the loom? Some companies see licensing their content to train AI engines as a potential way forward; the Associated Press, for example, recently announced a licensing deal with OpenAI. But as Roberts Golding notes, others have started protecting their websites from scraping tools, and rumors are circulating that several top media companies could soon bring big AI firms to court. Mehtab Khan, a lawyer and legal scholar at Harvard’s Berkman Klein Institute, told Roberts Golding that suing would be a gamble—defeat could set the precedent that AI training constitutes fair use. When it comes to AI and copyright more generally, Khan said, the key is to find “a balance that would allow the public to have access, but would also address some of the anxieties that artists and creative industries have about their words being used without consent, and without compensation.”

Leave a Reply

Your email address will not be published. Required fields are marked *