Common Crawl removes AI dataset containing over 2M news articles

At the request of BREIN, Common Crawl has removed over two million news articles belonging to popular Dutch news outlets from its AI training dataset.
According to BREIN, a Dutch non-profit foundation uniting authors, performing artists, publishers, producers, and distributors of music, films, series, books, images, games, and interactive software to combat piracy in the Netherlands, these articles were copied without permission and used to train generative AI models.
Common Crawl is an American non-profit organization that crawls the internet to provide an extensive archive or dataset to the public. Tech companies use this dataset as training data for their generative AI models, including Apple’s openELM, Microsoft’s Phi, OpenAI’s ChatGPT, NVIDIA’s Nemo Megatron, Deepseek’s Deepseek V3, and Anthropic’s Claude.
BREIN claims that Common Crawl’s archive consists of petabytes of mostly copyrighted works, including news articles that the company has been collecting since 2008. Additionally, Common Crawl continually adds new content published on the internet to its archive each month.
According to BREIN, Common Crawl’s database includes articles that have been published on well-known Dutch news websites and digital papers. However, no permission has ever been given to authorize the scraping of this content.
On behalf of several Dutch news publishers, BREIN has requested that Common Crawl remove these articles from its database so that tech companies can no longer train their generative AI models with this illegally obtained content.
Common Crawl has complied with BREIN’s request and removed two million articles from its archive.
Bastiaan van Ramhorst, CEO of BREIN, is happy with the outcome.
“The large-scale unauthorized use of protected works to train generative artificial intelligence models is a massive copyright infringement,” he says in a statement.
NDP Nieuwsmedia, the Dutch umbrella organization for news companies, welcomes BREIN’s action against illegal scraping.
BREIN frequently goes against companies that are involved in AI and heavily rely on copyrighted material. The same is happening across the pond. The New York Times previously sued OpenAI and Microsoft for using articles from that newspaper without permission to train chatbots.
Unlock more exclusive Cybernews content on YouTube.