How to Prevent LLM Bots from Scraping Your Website for Training Data

How to Prevent LLM Bots from Scraping Your Website for Training Data
  • Published: January 31, 2025

Are you worried about your content ending up in AI services like ChatGPT? You should be!

To improve their AI models, the companies behind these services are using web crawlers to scrape vast amounts of data from public websites. This scraped data is then used as training data to develop their AI models, specifically to train the large language models (LLMs) powering these AI services.

Almost every public website is being crawled to gather more training data, yet there is no official opt-out process to prevent this kind of data harvesting.

This large scale web crawling by AI and LLM data crawlers raises concerns about unauthorized content harvesting, data privacy, and copyright issues. We will explain what website operators can do to prevent their conntent and data being used for future AI models

.

Block known LLM scraper bots

Some AI companies have defined official user agent identifiers for their data crawling bots. In this case, the bots can be instructed not to crawl the content by defining rules in your sites robots.txt file. Here is a example that will block the crawlers from OpenAI ChatGPT, Antrophic Claude, Perplexity and other known LLM scraper bots:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PanguBot
Disallow: /

User-agent: Bytespider
Disallow: /

Block AI scrapers by checking their IP address

The reality is that not all AI companies are disclosing their scraping bots. Your content can still be scraped by scrapers that do not identify themselves as bots and pretend to be regular, human web traffic.

To prevent these anonymous scrapers and blend in with regular traffic, you will need to apply different security mesures such as IP blocking.

LLM crawlers generally operate from datacenters. These datacenters have specific IP address blocks that are different from residential IP address blocks used by regular users. Using Focsec’s is_datacenter detection feature, you can detect and block traffic from datacenters, hosting providers and public cloud providers that LLM data scrapers typically use. This prevents unauthorized scrapers from harvesting your content while allowing legitimate human visitors.

The Focsec API allows you to check if a IP address belongs to a known datacenter, cloud or hosting provider. We also offer a daily updated database containing a comprehensive collection of datacenter IP addresses worldwide, allowing you to easily block these potential bot IPs.

Want to detect VPNs, Proxies, TOR, Bots, Hackers and more?

The Focsec API detects VPNs, TOR, Proxies and Bots. It helps prevent fraud and protect against suspicious logins and attacks.

Read more »