Open AI’s crawler is the most blocked AI bot


Websites tend to block Open AI’s web crawler via robot.txt the most, but it isn’t the most active.

All the major chatbot developers need to feed their chatbots with new data to train the large language models underpinning them. One way to get data is to scrape the internet using AI crawlers, which visit various websites and collect data.

To determine which AI crawlers are the most active, web security solutions provider Cloudflare looked at common AI crawler user agents and aggregated the number of requests on its platform from these AI user agents over the last year.

According to the company, Bytespider, operated by Chinese company Bytedance, tops the list. The crawler is used to gather training data for its large language models, including those that support its ChatGPT rival, Doubao.

Amazonbot, which indexes content for Alexa’s question-answering, is the second most active AI crawler in terms of requests. It is followed by ClaudeBot, which trains the Claude chatbot. OpenAI’s GPTBot comes in fourth place.

“Among the top AI bots that we see, Bytespider not only leads in terms of the number of requests but also in both the extent of its internet property crawling and the frequency with which it is blocked. Following closely is GPTBot, which ranks second in both crawling and being blocked,” Cloudflare writes in a blog post.

The company says that many customers are likely not aware of the more popular AI crawlers actively crawling their sites.

This conclusion is based on an analysis of the top robots.txt entries across the top 10,000 internet domains, which were used to identify the most commonly actioned AI bots. The company then looked at how frequently it saw these bots on sites protected by Cloudflare.

The data shows that customers most often reference GPTBot, CCBot, and Google in robots.txt, but do not specifically disallow popular AI crawlers like Bytespider and ClaudeBot.

In June, AI bots accessed around 39% of the top one million internet properties using Cloudflare, but the company notes that only 2.98% of these properties took measures to block or challenge those requests.

Moreover, the more popular an internet property is, the more likely it is to be targeted by AI bots, and, correspondingly, the more likely it is to block such requests.

Cloudflare can see website operators completely blocking access to these AI crawlers using robots.txt, a set of instructions for bots that tells them which web pages they can and cannot access.

However, these blocks are reliant on the bot operator respecting robots.txt and adhering to specific rules.