AI training controversy: NYT may join authors in scraping fight


The environment for raising newborn AI models is deteriorating. Companies and authors are increasingly opposed to AI being trained with their copyrighted material, which has become the target of huge scraping operations. The courts may have the final say.

While thousands of authors are urging AI companies to stop using their hard work without permission, ChatGPT may be silenced by the looming New York Times (NYT) lawsuit against OpenAI over intellectual property rights, as reported by NPR.

News about the potential lawsuit came after weeks of unsuccessful negotiations, with parties trying to reach an agreement on a licensing deal that would’ve enabled OpenAI to incorporate NYT stories in its AI tools in exchange for payment. Two NPR sources confirmed the potential lawsuit due to “the discussions becoming so pretentious.”

ADVERTISEMENT

Microsoft-backed startup OpenAI successfully reached a similar deal before with the Associated Press, one of the largest news agencies.

Tech companies have introduced AI models as services with various monetization models, yet they’re all are based on someone else's work.

To fight content absorption, the NYT updated its terms of service in August. Changes were introduced to prohibit using any of its content for AI training without written permission.

If brought to court, OpenAI could be fined up to a maximum of $150,000 for each piece of infringing content, if the court finds that the infringement was committed willfully. The minimum sum for a proven violation may be as low as $200. The copyright owner is entitled to recover the actual damages.

Usually, training an AI model involves working with data sets with millions of works.

All large language models, such as OpenAI’s GPT-3.5 and GPT-4 (ChatGPT), Google’s PaLM 2 (Bard), Meta’s Llama 2, and others, are being trained on data from the “whole internet.” Google even revised its privacy policy to allow the use of “publicly available information to train Google’s AI models,” as first spotted by Gizmodo.

The lawsuit could put OpenAI at risk of having to completely rebuild its large language models from scratch without using copyrighted data. A high-profile legal battle, if successful, would encourage similar claims against big tech.

Google just recently announced that its AI would provide summaries of articles while browsing in search, a practice that could greatly diminish the need to visit the actual news source.

ADVERTISEMENT

And that is precisely one of NYT’s main fears in the dispute with OpenAI, according to one of NPR’s sources, “The need to visit the publisher's website is greatly diminished.”

The NYT is not alone in contending AI training practices by scraping data from the internet. Comedian Sarah Silverman and other popular authors have sued OpenAI over the remixing of copyrighted works. In April, Getty Images sued Stability AI, the creator of the AI image-generator Stable Diffusion, for training its model using photos without authorization. Image generator Midjourney was also named in a separate lawsuit for using billions of copyrighted images.

The main question that needs to be answered is whether scraping is considered legal. While growing opposition says it isn’t, some legal precedents may hint at the fair use doctrine that was applied to the Google Books library, hosting millions of scanned books.

In this case, the court ruled out copyright infringement. But AI creators will have to prove that their use case is not a substitute for media coverage or authors' works.

The US Federal Trade Commission has opened an investigation into OpenAI, after claims that it broke consumer protection laws by putting personal reputations and data at risk.

A group of media outlets have formed a coalition to pressure OpenAI into paying for the use of their work. Two European institutions, the European Parliament and the Council of Europe, have taken decisive steps toward regulating this transformative technology.

The AI Act is supposed to benefit society and journalism with a robust and responsible framework, emphasizing upholding traditional journalistic values, amongst others.