Reddit blocks Internet Archive to stop unauthorized AI data scraping


The Internet Archive (IA) is currently in talks with Reddit after the latter restricted the archive’s ability to save content posted on the platform.

This move came after Reddit discovered that some AI companies have been collecting information indirectly by scraping content archived on the IA, which saves certain Reddit content.

Data scraping has been a long-standing issue for companies from both sides of the debate. On the one hand, companies often use scraping to collect large amounts of data that is later used in various ways, such as marketing purposes or for training AI models.

ADVERTISEMENT

On the other hand, companies whose data is frequently scraped, such as Reddit, have set up rules against unauthorized scraping to protect user privacy and control how their data is used.

In this case, the IA, better known for its Wayback Machine tool, stores snapshots of web pages. That’s the goal of the IA – to archive the history of the internet. The data collected includes popular posts, comments, community discussions, and user profiles from Reddit.

Now that the restrictions are being enforced, the archive will only save screenshots of Reddit’s homepage – no more full pages and comments. This limits its ability to serve as a comprehensive backup of Reddit content, especially for deleted posts or detailed user activity.

time machine and old web pages
By Cybernews.

Reddit has not publicly disclosed the AI companies involved in the scraping via the Internet Archive, but the company's spokesperson, Tim Rathschmidt, acknowledged to Ars Technica that Reddit is aware of violations where AI firms scrape data indirectly from archived content.

Reddit suggests that the Internet Archive take additional measures to prevent this unauthorized scraping, which could presumably lead to the lifting of some restrictions in the future.

Reddit also cited user privacy concerns as a reason for the restrictions. The Wayback Machine archives content that users have deleted, which goes against Reddit’s rules on user privacy.

Historically, even though there are plenty of other tools to browse “the old internet,” Reddit users have relied on the IA to find deleted comments or threads. The archive played an important role in preserving Reddit content during major platform changes. For example, in 2023, Reddit introduced API restrictions that resulted in some of the content being deleted.

ADVERTISEMENT
James Caunt Ernestas Naprys jurgita Niamh Ancell BW
Get our latest stories today on Google News

The Internet Archive has not commented on potential solutions or how the latest restrictions might impact its role as a public web resource. Mark Graham, director of the Wayback Machine, described the relationship with Reddit as “longstanding” and confirmed ongoing discussions.

It’s also speculated that Reddit’s move is financially motivated. By limiting free data scraping, Reddit may be encouraging AI companies to enter paid licensing agreements for its data.

This wouldn’t be unprecedented – Reddit has inked similar deals with OpenAI and Google. The Google deal alone was reportedly worth $60 million, with Reddit expecting to earn over $200 million from such partnerships in the coming years.