Researchers find LLMs printing copyrighted materials from books like Harry Potter


Researchers have demonstrated that large language models (LLMs), such as ChatGPT, memorize large chunks of copyrighted materials. The chatbots accurately repeated more than 50 words from books like Harry Potter.

ChatGPT can recite the first 50 lines of the Bible for you, no problem. However, even new books are not safe from the redistribution of large chunks of text, raising copyright concerns.

Researchers managed to make the GPT-3.5 model spit out precise quotations of more than a hundred-words from books such as Harry Potter and the Sorcerer’s Stone, Gone with the Wind, and Lolita.

ADVERTISEMENT

“Such memorization may facilitate redistribution and thereby infringe intellectual property rights. Is that really fair?” The new paper from researchers at the Department of Computer Science of the University of Copenhagen and the University of Electronic Science and Technology of China reads.

The researchers managed to achieve this with a simple tactic – direct probing. They asked various LLMs direct questions such as “what is the first page of [TITLE]?” The list of books included 19 best-sellers released after 1930.

The record was achieved by GPT-3.5, which produced a 161-word quotation from Harry Potter and the Sorcerer’s Stone after five runs. The more capable GPT-4 model was not tested.

The results were described as “a conservative characterization of the extent to which language models can redistribute these materials,” as more could be achieved with carefully optimized prompts.

“Books such as Lolita, Harry Potter and the Sorcerer’s Stone, and Gone with the Wind, appear to be highly memorized, even with our simple probing strategies, leading the models to output very long chunks of text raising copyright concerns,” researchers wrote.

The larger the model, the more it memorized

Researchers revealed a linear correlation between the size of the LLMs and the amount of copyrighted material they can print out.

“Larger language models may increasingly infringe upon existing copyrights in the future,” researchers noted.

ADVERTISEMENT

Chatbots with less than 60 billion parameters, such as OPT, Pythia, Falcon, and LLaMA, failed to memorize and reproduce more than 50 words on average.

However, Claude and GPT-3.5 Turbo scored above 50 words in over half of the books tested.

The most famous works seem to be at the highest risk of copyright infringement, as a book’s popularity significantly correlated with the memorization amount demonstrated.

Moreover, LLMs hallucinate or rephrase parts of the copyrighted content, which were not included in the Longest Common Subsequence measure.

How many words are too many?

While researchers do not want to draw any legal conclusions from the paper, they also discuss copyright law protections for creators. Laws in the US and Europe allow some fair use of copyrighted material. However, exceptions are limited to the extent of quotations or the number of book copies held in libraries.

“For book-length material, some say a quotation limit of 300 words is common practice, but others have argued for anything from 25 words to 1000 words. A limit of 50 words is common for chapters, magazines, journals, and teaching material,” the paper reads.

Researchers argue that more than 300 words can lead the court to weigh against fair use.

Their work Copyright Violations and Large Language Models (arXiv:2310.13771 [cs.CL]) was supported by the Novo Nordisk Foundation.

ADVERTISEMENT