AI and data scraping: websites scramble to defend their content


AI startups leveraging data scraping practices are in hot water, with multiple lawsuits in the pipeline already. Large social media sites are looking for ways to defend their data. However, there’s a hitch – scraping isn’t illegal.

"Several entities tried to scrape every tweet ever made in a short period of time. That is why we had to put rate limits in place.”

This is how Elon Musk, Twitter’s owner, explained the decision to limit how many tweets different tiers of accounts could read each day, at the beginning of July.

ADVERTISEMENT

Users weren’t happy. But Twitter, now being rebranded as X, then showed that they were serious by filing a lawsuit against four entities in Texas for data scraping.

WFAA, an ABC-affiliated TV station, reported that the volume of automated sign-up requests from the four defendants' IP addresses far exceeded what any single person could send to a person, which severely taxed Twitter's servers.

Data scraping, which refers to a technique in which a computer program extracts data from output generated from another program, is indeed becoming a big problem for large social media sites such as Twitter or Reddit.

How to train your dragon

For instance, Reddit’s boss and co-founder Steve Huffman told The New York Times in April that he found it unacceptable that AI companies such as OpenAI, the firm that created the viral ChatGPT bot, have been scraping huge amounts of Reddit data to train their systems – all for free.

“The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free,” said Huffman, who then angered thousands of popular online communities by deciding to monetize access to the site’s data.

Google also faces a class-action lawsuit, which was filed soon after the tech giant updated its privacy policy to allow data scraping for AI training purposes. OpenAI is blamed for allegedly using copyrighted books without permission to train its AI systems in yet another lawsuit.

More skirmishes are looming. The New York Times, News Corp, Axel Springer, Dotdash Meredith owner IAC, and other publishers are in the process of forming a coalition to take on AI giants such as Google and ChatGPT creator OpenAI, Semafor reports.

ADVERTISEMENT

The issue here is slightly different, as online content allegedly used for training AI models is often copyrighted. But publishers are determined not to repeat what many see as the mistakes of the social media era when they gave away their content for free.

How can companies avoid similar situations to Twitter and protect their own websites from bad actors or competitors who are scraping and stealing their data?

Twitter tries to defend against data scraping. Image by Shutterstock.
Twitter tries to defend against data scraping. Image by Shutterstock.

To find the (possible) answer, Cybernews had a chat with Dan Pinto, co-founder and chief executive of Fingerprint, one of the data intelligence platforms providing firms with real-time data about visitor intent.

Fingerprint says that it’s not only generative AI models that can scrape companies’ data for training. Bad actors or competitors could also steal the material and use it for nefarious purposes.

How to fight the dragon

According to Pinto, companies can implement, for instance, web application firewalls (WAFs) and block IP ranges, countries, and data centers known to host scrapers. CAPTCHA challenges may also be applied.

Obviously, Pinto says that device intelligence platforms are probably the best solution, even though he cautiously adds: “With data scraping, you can never prevent 100% of the attempts. Your goal is to increase the difficulty level for scrapers to the correct level for your business.”

CAPTCHA challenges offer more serious protection than WAFs, Pinto explains. But there’s also a risk that legitimate customers will be annoyed by having to do CAPTCHAs if your method of detection is not accurate enough.

“Device intelligence solutions collect browser data leaked by bots, such as errors, network overrides, browser attribute inconsistencies, and API (application programming interface) changes, to reliably distinguish real users from headless browsers, automation tools, and plugins commonly used for scraping,” said Pinto.

ADVERTISEMENT

“With data scraping, you can never prevent 100% of the attempts. Your goal is to increase the difficulty level for scrapers to the correct level for your business.”

Dan Pinto.

“This means that your legitimate users won’t be annoyed by CAPTCHAs while you catch a significant number of data scrapers.”

Pinto said it was like a game of cat and mouse between bad actors changing how the bots behave and services like Fingerprint updating the techniques that they use to detect them.

The cases of Twitter and Reddit don’t seem too complicated to Pinto, by the way. Just like many third-party app developers, he claims that companies need to maintain open APIs and charge appropriate prices while making data scraping very challenging at the same time.

“This will reduce the number of data scrapers which went up for Twitter when they shut off their APIs and it appears to have gone up for Reddit based on recent API changes,” he said.

Besides, larger companies usually have robust, highly skilled teams capable of developing complex and customized security measures – although the success of data scraping lawsuits is unclear.

Maybe the dragon isn’t your enemy?

That’s because web or data scraping isn’t actually illegal. First, ordinary data scraping might help businesses, including, of course, AI startups, to grow much faster.

Pinto himself used to work on a search engine for used machinery and used crawling – data mining from different web sources – to collect information on the machinery available for sale online. He viewed crawling as ethical “because it helped both equipment buyers and sellers to complete many more transactions than before.”

Second, it only becomes a problem when non-publicly available data gets extracted. It doesn’t even matter what the purpose of the action is – it’s not even scraping but theft.

ADVERTISEMENT

“Regulations, policies, and even best practices are still being figured out, but recent rulings have pointed towards if information is available in the open it should be accessible to bots,” Pinto told Cybernews. “This points again to focusing on making scraping difficult instead of depending on lawsuits.”

However, as the cases of Twitter and Reddit clearly show, quite a few large websites aren’t really crazy about getting scraped. And neither are individual Google users, for that matter.

Even if there’s not much that can be done in a court of law, it’s quite clear that crawled data is helping someone else get rich because AI tools are and will continue to be very valuable. At a basic level, search engines basically offer an exchange to website owners: let us scrape, and we’ll send traffic your way – but with training, it’s oh so different.

“AI scraping is more morally gray than search engines because the value does not flow back to the original creator of the content. The more disconnected the flow of value from the original author, the more unethical the data scraping is,” Pinto explained.

“Regardless of the ethics, businesses should protect themselves from data scrapers that they don’t want to share data with. It’s up to every business how open they want their data to be, and we can be their partner in that endeavor.”