The AI IP wars: when tech giants flip their stance on copyright


Over the last few years, big tech has scraped free data from the internet to train its machine-learning algorithms. But that quickly changed when the tables were turned and competitors began using their outputs. Suddenly, it became a problem. This pattern is appearing in so many corners of AI that it's becoming more comical than ironic.

When hoovering up online content, many in big tech defended their actions by saying, "If it's on the internet, it's fair game." Yet when DeepSeek and other rivals came along, they quickly got passionate about protecting intellectual property. But how did we get here?

Early scraping vs newfound outrage

ADVERTISEMENT

OpenAI originally collected thousands of websites, documents, videos, and images to train GPT models. Meta followed suit with its large language model efforts, using everything from open encyclopedias to random blog posts and user data. They justified it all by saying these new AI systems need massive data sets for better results.

Big tech had no problem reminding critics that many earlier search engines and archiving tools used similar indexing strategies in a desperate bid to justify its actions. However, things took a dramatic turn when the next generation of AI startups began building their models by drawing on the outputs produced by these same tech giants.

Systematically querying ChatGPT or using an open-source approach from Meta, analyzing the replies, and training a new system set off a few alarm bells. It's the same logic behind how GPT or Llama was made in the first place. Gather data, feed it in, and watch the results.

Yet once it's their data on the line, the once-flexible corporations make noise about possible terms-of-service violations or blatant infringement. But this feels inconsistent at best. Either widespread scraping is acceptable, or it isn't.

The real challenges in IP contracts

Amid these overlapping complaints, Alex Watt, an IP lawyer at Howard Kennedy, noted that many agreements haven't evolved to address AI. In a recent interview, Watt told me, "I don't think that the agreements that I'm seeing have caught up completely with the impact that AI is going to have unless it's a very AI-focused agreement."

In other words, older contracts never imagined a scenario in which massive neural networks would sample entire catalogs of text or images.

Watt also gave the example of arranging for an ad agency to produce work: "When you're engaging a production team to create an advert for you, whereby you want to own all the intellectual property in that advert, you must insist that either no AI content is incorporated, or if it is incorporated that you notify the procurement services and provide indemnities to say that you own the IP."

ADVERTISEMENT

His perspective highlights the confusion over ensuring AI-driven outputs are consistently protected by the original client's interests. After all, how do you prove where the AI content came from if the data sets are enormous and often untraceable?

The human authorship question

Ed Klaris of Klaris Law believes AI can complement human creativity if a real person guides the process. "I believe that anybody who gets involved with creating content as a partner with AI should be the author," he told me at Web Summit in Lisbon last year.

"If we disincentivize people from using the technology, we will lose a great opportunity."

From his angle, you don't want to punish people trying to explore new tools. Yet there's still the sticky matter of training data. Who owns that material? And should companies that scrape the entire web be able to claim infringement if someone else scrapes them?

Klaris thinks that some baseline for fairness would help. Disputes would be less thorny if you have clear guidelines around authorship – human plus AI as a co-creation. Yet it's tough to see an easy path forward when the most prominent names were built on massive datasets that included everything from news articles to personal blog posts.

There is no avoiding the fact that we are dealing with two sets of double standards. It's easy to see why OpenAI would say that training a rival model purely on ChatGPT responses is underhanded because it bypasses the cost of building a system from scratch. It's an imitation that relies on their tech to do the heavy lifting.

Outside of the big tech bubble, many will quickly point out that this is what OpenAI did on a broader scale with public internet data. The difference is that the target is no longer a scattered group of websites but a single enterprise that doesn't like having its specialized knowledge siphoned off.

Meta is another excellent example of how big tech championed open releases of some AI frameworks. Yet the same organization complains that users or other companies might repurpose its proprietary data. So, the big questions are: agreements should be open when they help a big tech brand and closed when they benefit a smaller competitor.

jurgita Niamh Ancell BW Ernestas Naprys Marcus Walsh profile
Don’t miss our latest stories on Google News
ADVERTISEMENT

Artists, authors, and the fallout

The writers and artists who create the original content are caught in the middle. They rarely see compensation if an AI system trains on their work, and most probably don't even know that it's happening.

But when these creators speak up about potential infringement, big tech points to fair use or the broad public domain. This can feel incredibly unfair to the original authors since the scale and resources of these major AI developers often overshadow them.

Without any hint of self-awareness, when a new AI project uses Meta or OpenAI outputs, big tech has no problem saying, "Hey, that's not right. You're stealing from us."

As an eternal optimist, I'm hopeful these companies will eventually reach out to everyday creators after learning from the errors in their ways after being on the other side of an IP conflict.

Maybe one day, they’ll say, "Let's share revenue," or "We'll license your material properly." Unfortunately, big tech only relies on legal theories about the fairness of IP until it becomes the one that loses out.

Why the AI IP clashes are growing

The pattern of scraping and repackaging data is unlikely to vanish anytime soon. As more AI startups arrive, they likely rely on the same notion that large bodies of text, images, or code are fair game. The incumbents will insist that these new players need formal licenses or must pay for the privilege. At a deeper level, these arguments reflect how messy the entire framework has become.

The law hasn't kept pace with the scale of machine learning. Courts still debate whether training data should be covered by standard copyright law – especially when the AI's final output isn't a direct copy but a derivative "understanding."

In the meantime, the hypocrisy is hard to ignore. When OpenAI once needed text from every corner of the web, they brushed aside complaints from journalists, artists, and content creators. Now, they act shocked that another outfit might be extracted from ChatGPT. The same is true for Meta, which quickly gathered user info yet tried to shield its AI models from easy duplication.

ADVERTISEMENT

Looking to the future, two pathways are emerging. One path features updated legislation or broad treaties that define how AI training can or can't use copyrighted data and whether there must be compensation for the owners of that material.

If that occurs, big tech might have to pay licensing fees to authors or face real penalties for ignoring them. But if judges agree that training data is automatically fair use, then complaints from big incumbents about smaller AI companies scraping them might fall flat.

Predictably, big tech has no plans of dropping its "collect first, worry later" approach, but they're quick to chastise competitors who do the same thing with their outputs. The losers might be those whose words, images, videos, or music were scraped. The fight to shape new copyright laws looks set to intensify, with lawsuits from both sides.

Pressure is mounting around AI and IP, which could eventually force a more coherent standard. Anything would be better than the Wild West of data collection that only benefits the usual suspects in big tech at the expense of everyone else.