How To Block AI and LLM agents, bots and scrapers

How To Block AI and LLM agents, bots and scrapers
  • Published: February 5, 2025

AI Tools like ChatGPT use Large Language Models (LLMs) as the technical foundation of their AI services. Training these LLMs requires huge amounts of data. This data is often scraped from the public internet, possibly including content from your website.

Additionally, new AI agents such as ChatGPT Operator can autonomously operate a web browser to automate certain tasks. Such AI agents, AI-powered scrapers and bots can mimic human browsing behavior, bypass traditional security measures such as captchas, and extract large volumes of data with ease.

In this blog post, you will learn:

  • Block AI and LLM data scrapers: These puts pull huge amounts of content from the public internet, using your content to train their AI models.
  • Block AI agents and AI-powered scrapers and web crawlers: Unlike traditional bots and scraping tools, these AI-powered web crawlers can mimic human browsing patterns and bypass traditional bot traps such as Captchas.

Block known AI and LLM User-Agents

Some AI and LLM data crawlers identify themselves to the web server via the User-Agent HTTP header. Some of the known LLM and AI User-Agents are:

  • GPTBot
  • CCBot
  • ByteSpider
  • Google-Extended
  • ClaudeBot
  • anthropic-ai
  • PerplexityBot

The above list contains some of the most well known AI and LLM bots, including ChatGPT from OpenAI, Antrophic Claude, Bytedance and the CommonCrawl bot. You can block these User-Agents from crawling your site by adding them to the Disallow section of your robots.txt file.

⚠️ This method is not 100% secure and will only work with known AI bots. Other LLM data scrapers may not identify themselves and pretend to be regular human traffic. So adding these User-Agents to your robots.txt file will only help you stop known crawlers that follow the robots.txt rules.

Block datacenter traffic using Focsec

AI and LLM crawlers are typically hosted in data centers and cloud providers such as AWS or Azure. This means that when they access your website to crawl and extract data, the IP address making the request will be associated with a datacenter IP. Our Focsec API allows you to easily detect datacenter IPs.

Most AI-powered scraping tools operate from datacenter IP addresses, which are different from residential IPs used by regular users. Focsec's is_datacenter flag helps you identify whether incoming HTTP requests originate from a datacenter IP address.

Rate Limiting crawlers

Rate limiting is an effective way to control the number of requests a client can make to your website within a specific timeframe. LLM and AI data crawlers will usually crawl your website much faster than a human would browse your site, so enforcing HTTP rate limiting on your website can help to keep bots out. Some best practices include:

  • Setting hourly and daily total request limits
  • Stricter request limits for anonymous traffic that is not logged in

Summary: Protect your website from AI agents and LLM data scrapers

Countering AI-powered web scraping requires a comprehensive approach.

Whether you want to prevent your content from being scraped for LLM training data, or you want to prevent AI-powered bots from doing automated tasks on your site: Countering AI-powered web scraping requires a comprehensive approach. You'll need to apply a variety of security precautions, some of which we have outlined in this post.

Start with basic measures like extending your robots.txt, enhancing detection datacenter IP detection with the Focsec API and applying rate limiting measures. Traditional methods of blocking bots, such as captchas can still be used, but are becoming less efficient as the quality of AI scrapers is constantly improving.

Want to detect VPNs, Proxies, TOR, Bots, Hackers and more?

The Focsec API detects VPNs, TOR, Proxies and Bots. It helps prevent fraud and protect against suspicious logins and attacks.

Read more »