Today’s Fahrenheit 451? Anthropic destroys millions of books to train its AI


The company paid for the physical copies, which technically allows it to do whatever it wants with them, butchering included.

Anthropic, an artificial intelligence company, has taken rather drastic measures to train its AI assistant, Claude.

It’s been reported that the company has been using print books to train Claude, ruining them in the process.

ADVERTISEMENT

The court documents reveal that the company has cut millions of books from their bindings, scanned them to turn them into digital files, and discarded them after use.

In February 2024, Anthropic hired Tom Turvey, who was the head of partnerships for the book-scanning project at Google. With this move, the company expected to repeat the same thing Google did – legal book digitalization, reports Ars Technica.

However, the approach in which Google and Anthropic were digitizing physical books was different.

Google’s Books project was based on the company’s special camera process, which can scan large amounts of books that are borrowed from libraries and later returned.

Konstancija Gasaityte profile Izabelė Pukėnaitė jurgita Ernestas Naprys
Don’t miss our latest stories on Google News

Meanwhile, Anthropic seemed to focus on the speed and the cost of digitizing, forgetting the need to preserve the used publications.

After the company bought the books in bulk, removed their bindings, cut the pages to the needed dimensions, scanned them into PDFs, and got rid of the physical copies.

It’s been reported that the judge has ruled that this operation was a fair use because Anthropic first legally bought the books it was scanning and later destroyed them, leaving only digital files that weren’t distributed publicly.

ADVERTISEMENT

This expensive operation is based on the fact that AI companies need high-quality texts to train their large language models (LLMs). This way, the AI assistant can provide users with better, more accurate, and more coherent answers than the one trained on information found online.

To obtain such content, AI companies need a license from publishers. However, another way was found: buying physical copies of books, which gives the buyer the right to do whatever they want with that copy.

This million dollar plan gave AI companies such as Anthropic an easy and faster way to feed their LLMs with qualitative content.

What raises eyebrows about this case is the fact that there are ways to digitize books without destroying them. For example, Google and OpenAI collaborated with Harvard libraries to train AI models on digitized books dating back to the 15th century that are still kept in their physical format.