NYT, USA Today among major news sites blocking the Internet Archive’s Wayback Machine: Why?


The Internet Archive’s Wayback Machine is a great resource for anyone looking for public web content from the very near and nearly ancient digital past. Now, 23 major news sites are blocking the web crawler commonly used for the project. But why?

The Wayback Machine contains more than one trillion archived web pages and is rightly called a treasure trove for journalists, researchers, or simply curious netizens.

But ia-archivebot, the aforementioned web crawler, is indeed blocked by 23 major news sites, including The New York Times, USA Today, and The Guardian.

ADVERTISEMENT

According to an analysis by the AI-detection startup Originality AI, in total, 241 news sites from nine countries explicitly disallow at least one of the four Internet Archive crawling bots.

Most are owned by USA Today Co., the largest newspaper conglomerate in the US, which operates more than 200 media outlets. The social platform Reddit announced last year that it would block the Internet Archive, too.

altered reddit logo to look sad, internet archive logo below in white
By Cybernews.

USA Today Co. spokesperson Lark-Marie Anton told Wired that “this effort is not about specifically blocking the Internet Archive” but instead part of the company’s broader efforts to block all scraping bots.

Similarly, Robert Hahn, The Guardian’s director of business affairs and licensing, pointed the finger at AI companies, which are eager to vacuum up content for free and use it for training large language models.

The Guardian has been talking with the Archive over “concerns over potential misuse by AI companies of content sets crawled for preservation purposes.”

Graham Jones, The New York Times spokesperson, was even more direct, telling Wired: “The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.”

jurgita justinasv Izabelė Pukėnaitė vilius Ernestas Naprys Gintaras Radauskas
Don't miss our latest stories on Google News. Add us as your Preferred Source on Google

It makes sense: media or social companies want the WayBack Machine to become a backdoor for AI firms to access content they’re creating or licensing (Reddit signed a deal with OpenAI in 2024).

ADVERTISEMENT

Indeed, an analysis of Google’s C4 dataset by The Washington Post in 2023 showed that the Internet Archive was among the websites in the training data used to build Google’s T5 model and Meta’s Llama models.

Some individual reporters are pushing back and organizing. About a month ago, Fight for the Future, the Electronic Frontier Foundation, and Public Knowledge posted an open letter thanking the Internet Archive for its preservation of news and history at a moment when major outlets are reconsidering their relationship with the project.

Check if your data has been leaked

Find out if your email, phone number or related personal information might have fallen into the wrong hands.
18,611,353,922
Breached accounts
36,030
Breached websites

Signatories range from television anchor Rachel Maddow to independent reporters like Kat Tenbarge and Taylor Lorenz.

“In previous generations, journalists would turn to the physical archives of a local newspaper or of a local public library to access historical reporting and follow the threads of the present back into history,” the letter reads.

“With many newspapers closed, and no clear path for local public libraries to preserve digital-only reporting, the work of safeguarding journalism’s record increasingly falls to the Internet Archive.”


Unlock more exclusive Cybernews content on YouTube.

ADVERTISEMENT