Law pros defend Meta's use of pirated books for Llama AI. But is it really "fine"?


Meta is arguing that sourcing copyrighted content, including pirated books, as training material without permission is fair use. Now, a group of law professors have backed the tech giant’s stance.

Writers Richard Kadrey, Sarah Silverman, and Christopher Golden initially filed the class action complaint on July 7th, 2023. The authors alleged that Meta’s large language model Llama was trained by copying and ingesting massive amounts of copyrighted text.

In February, they also unveiled new evidence that Meta torrented at least 81.7 terabytes of data across multiple shadow libraries through the site Anna’s Archive – and said that the magnitude of the scheme was “astonishing.”

ADVERTISEMENT

Not so, a group of intellectual property law professors now say. Scholars from Harvard, Emory, Boston University, and Santa Clara University have submitted an amicus brief and supported Meta’s stance.

As a reminder, an amicus brief is a legal document submitted by individuals or organizations because they have an interest in the pending case, even though they aren’t directly a party to the case itself.

Fair use: is this just Meta’s internal copying?

After uncovering the fact that Meta used BitTorrent to download pirated books to use as training material, the plaintiffs in the class action lawsuit said last month that this was the clearest proof yet of copyright infringement.

“The uncontroversial implication is that for fair use to apply, the work that was copied must have been lawfully acquired in the first place,” the authors wrote.

Meta has so far relied heavily on a fair use defense and said that its use of “publicly available datasets” to train the large language models (LLMs) was extremely important to the future of generative AI development in the US and is absolutely fine under US copyright law.

Now, four prominent intellectual property professors said that Meta’s use of pirated books was indeed fair. According to the amicus brief (PDF), the source of the training data is not determinative as long as it’s used to create a new and transformative product.

Meta Llama AI models
By Getty Images
ADVERTISEMENT

Essentially, the argument is that using books outside their original “reading” purpose to create an AI model transforms the purpose of the use. In other words, this is internal copying and fair use.

The professors also note that previous cases where fair use was denied typically revolved around copyright infringement related to personal consumption – not the use of content to create something new.

Some countries, such as Japan and Singapore, have crafted exceptions in their laws to allow tech companies to train LLMs on copyrighted material without permission.

For example, Keiko Nagoaka, Japan’s minister of education, culture, sports, science, and technology, indicated in January that AI companies in Japan can use “whatever they want” for AI training “regardless of whether it is for non-profit or commercial purposes, whether it is an act other than reproduction, or whether it is content obtained from illegal sites or otherwise.”

Konstancija Gasaityte profile vilius jurgita Ernestas Naprys
Don’t miss our latest stories on Google News

The US has no such exceptions, but in their brief, the law professors urge the court to consider fair use. As the VCR and other innovations showed, copyright shouldn’t stand in the way of new tools and developing technologies.

“Copyright owners have often predicted that new technologies, from photocopying to home VCRs to the internet, would create disasters for copyright owners and that fair use needed to be shrunk to protect them. Instead, new technologies have routinely created new markets,” they said.

“Whatever the risks of AI – and there may be many – condemning the act of creating large-scale training datasets as copyright infringement is not the answer.”

Humanoid robot with a pirate symbol on its chest
By Cybernews

Do the arguments hold water?

ADVERTISEMENT

The opposing side – the one behind the class action lawsuit – disagrees, of course, and says substantial flaws can be found in the professors’ reasoning.

Writing on his blog last week, Pascal Hetzscholdt, the Senior Director of Content Protection at Wiley, a global publisher of trade books, textbooks, and scientific research, pointed to the US Supreme Court’s recent clarification that transformative use must serve a fundamentally different purpose than the original.

“Meta’s AI models are fundamentally commercial products that utilize copyrighted works to generate content that often serves the same purpose as the original works,” said Hetzscholdt.

According to him, the amicus brief somehow fails to mention that Meta’s models aren’t really academic or research endeavors. On the contrary, they’re commercial products designed to generate billions in revenue.

The unsealed emails reveal Meta employees discussing the risks of being caught and proactively suggesting using VPNs to hide their activities.

The 2025 unsealed emails show Meta's legal team was directly involved in discussions to stop licensing efforts in favor of using pirated sources, demonstrating willful commercial exploitation rather than good-faith transformative use, said Hetzscholdt.

Besides, the amicus brief dismisses market harm concerns, but empirical research directly contradicts this position.

The study “Cloze Encounters: The Impact of Pirated Data Access on LLM Performance” (PDF) provides concrete evidence that AI models perform measurably better when trained on copyrighted works, with performance improvements of up to 23% when using pirated books.

This empirical evidence establishes a direct link between unauthorized use of copyrighted material – on a massive scale – and commercial benefits.

Finally, Hetzscholdt notes that Meta’s documented conduct goes beyond mere unauthorized access because the unsealed emails reveal Meta employees discussing the risks of being caught and proactively suggesting using VPNs to hide their activities.

“Courts have consistently held that deliberate circumvention of access controls and concealment efforts undermine fair use claims,” he wrote.

ADVERTISEMENT