LLM Web Scraping Surge Disrupts Hosting Industry| Cybernews

The AI revolution’s aftershocks are being felt in more ways than one.

The Wikimedia Foundation has spotted a worrying trend in recent months. It says multimedia downloads from Wikimedia Commons have swallowed up 50% more bandwidth since January 2024, a surge the organisation blamed squarely on artificial intelligence (AI) bots that ignore or sidestep robots.txt rules.

They’re not alone. Open‑source documentation service Read the Docs found one errant crawler pulled 73 terabytes of HTML from their website in May 2024, leaving a $5,000 bill for excess traffic before engineers could block it.

This isn’t just a handful of cases. Akamai found in the middle of 2024 that bots now account for 42% of all web traffic, with nearly two‑thirds of them classified as malicious scrapers, rather than search‑engine indexers.

The bots being sent scurrying out over the web are so hungry because of AI’s insatiable appetite. Modern large language models and image generators rely on gigantic, mostly unlicensed corpora scraped from public URLs. Crawlers such as GPTBot, ClaudeBot, and Bytespider methodically copy every reachable page because even niche material can improve a model’s answers.

ai bots in blue background — By Cybernews.

Turning the screw on site operators

For site operators on usage‑priced clouds, this turns free exposure into a hidden tax: each terabyte a bot hauls away can cost up to $9 in egress fees, and the concurrent requests can drown caches or databases built for the sort of traffic levels humans operate, not automated systems.

Hosting providers are fighting back – or at a minimum, offering their customers weapons they can deploy to bash the bots if they want. Anti‑AI controls meant to restore leverage to content owners are now commonplace among many providers. Cloudflare added a dashboard toggle that blocks more than two dozen known AI user‑agents across all plan tiers in July 2024. Notably, it was also included in the free tier, suggesting this is a must-have in the new world of the web, rather than a premium product. The company described it as an “easy button” for customers who do not want their prose or pictures “fed into someone else’s model.”

Cloudflare has developed other tools also designed to beat back the scourge of bots. With its AI Audit & Control feature, webmasters can do more than simply deny access.

It shows publishers which models have fetched which URLs, how often, and from where, then lets them throttle or selectively license those bots. The company says this level of analytics isn’t just a monitoring platform, but could be a bargaining chip for organisations negotiating paid data‑licensing deals with model providers.

Stay informed and get our latest stories on Google News

Add us as your Preferred Source on Google.

Sabotage and more

In March 2025, Cloudflare went a step further with AI Labyrinth: an opt‑in honeypot that quietly inserts paths of AI‑generated decoy pages into a site. The tool is designed so misbehaving crawlers wander the maze, burning compute on nonsense text while revealing new fingerprints to Cloudflare’s detection systems.

Human visitors never see the trap, but Cloudflare gets extra intelligence on how to avoid future incursions.

bots stuck in yellow maze in blue background

Other providers are offering similar services. Fastly’s AI Bot Management service extends its existing bot platform with four new signals, including verified and suspected “AI Crawler” and “AI Fetcher” categories, so customers can allow reputable bots while bogging down unknown scrapers.

And Akamai has developed its own Firewall for AI, which filters unwanted crawlers, blocks prompt‑injection attacks, and can shunt AI traffic into “pay‑as‑you‑scrape” arrangements.

Combined, these tools suggest the hosting industry is trying to fight back against the challenges posed by AI scraping. For years, the open web’s default posture of openness by default meant anyone could copy, too.

But that was before the rise of large language models (LLMs) and the eye-watering valuations of the companies behind them. Hosting providers recognise the money involved, and want a slice of the pie, too – or to block the bots entirely.

How has a recent surge in LLM applications scraping the web impacted the hosting industry?

More from Cybernews

Turning the screw on site operators

Sabotage and more