Meta leeched 82 terabytes of pirated books to train its Llama AI, documents reveal


Authors suing Meta for pirating their books have unveiled new evidence that Meta torrented at least 81.7 terabytes of data across multiple shadow libraries through the site Anna’s Archive.

“I feel that using pirated material should be beyond our ethical threshold,” one of the unsealed Meta employee’s messages from 2022 reads.

Meta internal messages reveal an employee providing a status update on downloading 10TB from Libgen, 54TB from Z-Library, and 126TB of data from the Internet Archive. The employee complained that there were too few seeds and the download speeds were low.

ADVERTISEMENT

Famous authors suing Meta argue that the magnitude of Meta’s unlawful torrenting scheme is astonishing.

“Just last spring, Meta torrented at least 81.7 terabytes of data across multiple shadow libraries through the site Anna’s Archive, including at least 35.7 terabytes of data from Z-Library

Niamh Ancell BW Marcus Walsh profile Ernestas Naprys jurgita
Get our latest stories today on Google News

and LibGen. Meta also previously torrented 80.6 terabytes of data from LibGen,” the court document reads.

Writers Richard Kadrey, Sarah Silverman, and Christopher Golden initially filed the class action complaint on July 7th, 2023. The authors alleged that Meta’s large language model Llama was trained by copying and ingesting massive amounts of copyrighted text.

The initial complaint cited Meta, which said that 85 gigabytes of the training data came from a category called ‘Books,’ and one of the data sources was “ThePile,” a publicly available dataset. This dataset, allegedly, comprised 197,000 copyrighted books from notorious shadow libraries, such as Library Genesis, Z-Library, and Sci-Hub. The case became known as “Kadrey et al. v. Meta Platforms.”

The new evidence suggests that Llama models may have ingested millions of books.

Previously, US District Court Judge Vincent Chhabria dismissed nearly all of the claims against Meta, calling them nonsensical, such as claims that Llama is infringing derivative work. The authors were left with a sole remaining claim for direct copyright infringement that Meta directly infringed copyrights by training AI models on their books. The judge criticized the authors’ lawyers sharply for “dragging out litigation.”

ADVERTISEMENT

Now the authors are alleging Meta engaged in a large-scale pirating scheme. Not only was Meta torrenting the data itself, but it also was the seeder sharing massive amounts of copyrighted books, they claim.

One unsealed document shows Meta employees deciding not to use Facebook infrastructure for data downloading from pirated databases in order to “avoid risk of tracing back the seeder/downloader from FB servers.”

Textbook publishers believe they have far exceeded the “minimal showing that the crime-fraud exception could apply” and are asking the court to order Meta to provide additional information and documents related to the alleged piracy with the knowledge and involvement of Meta’s legal team.

“These documents show those very witnesses were intimately involved in that unlawful conduct. Mark Zuckerberg, for example, claimed to have no knowledge of LibGen or any involvement in its use,” the plaintiffs say.

Meta thinks it’s fair use

Meta previously filed a statement rejecting the notion that it has distributed pirated LibGen books. The tech giant argues that its use of public materials falls under the ‘fair use’ legal doctrine.

Last week, Meta filed a motion to dismiss two out of the three authors’ claims from the case.

“At the crux of this case is an issue of extraordinary importance to the future of generative AI development in the United States: whether Meta’s use of publicly available datasets to train its open source large language models (LLMs) – transformational technology powering innovation, productivity, and creativity – constitutes fair use under US copyright law,” a motion filed by Meta reads.

Meta believes that plaintiffs failed to allege that they were injured by the removal of copyright management information from their books before using them as training data. Meta also wants to remove the claim about it illegally accessing the data.

“Finally, Plaintiffs’ new theory of copyright infringement based on Meta’s alleged “distribution” of datasets is also facially defective,” Meta said in an earlier motion.

ADVERTISEMENT

“Plaintiffs do not plead a single instance in which any part of any book was, in fact, downloaded by a third party from Meta via torrent, much less that Plaintiffs’ books were somehow distributed by Meta.”

The tech giant hopes to debunk this “meritless allegation” on summary judgment.