AI Crawlers and robots.txt: GPTBot, Google-Extended, ClaudeBot

GPTBot, ClaudeBot, Google-Extended, and a growing list of AI training crawlers now require active robots.txt management. Here's every major AI crawler and its user agent, how to block them selectively, the distinction between blocking Google-Extended vs Googlebot, and what "respect robots.txt" actually means in practice.

AI training crawlers are now a significant source of web traffic — and robots.txt is the only tool to manage them

GPTBot, Google-Extended, ClaudeBot, CCBot, and a growing list of AI crawler user agents represent a new category of web traffic that didn't exist before 2022. These crawlers harvest web content to train large language models. They consume bandwidth, they may use your content in ways you haven't consented to, and they can put significant load on servers if not managed.

Robots.txt, which has been primarily an SEO tool for controlling Google, Bing, and other search crawlers, now serves a second function: managing AI training crawler access.

The major AI crawlers and their user agents

Organisation	Crawler	User Agent
OpenAI (ChatGPT)	GPTBot	`GPTBot`
OpenAI (API browsing)	ChatGPT-User	`ChatGPT-User`
Google (AI training)	Google-Extended	`Google-Extended`
Anthropic (Claude)	ClaudeBot	`ClaudeBot`
Common Crawl (dataset)	CCBot	`CCBot/2.0`
Meta (AI training)	FacebookBot	`FacebookBot`
Perplexity AI	PerplexityBot	`PerplexityBot`
Apple (AI)	Applebot-Extended	`Applebot-Extended`
Cohere	cohere-ai	`cohere-ai`
Amazon (Alexa)	Amazonbot	`Amazonbot`

This list changes as new AI companies emerge and existing companies add crawlers. Monitoring access logs for unfamiliar user agents is the only way to keep up.

Blocking AI crawlers in robots.txt

The robots.txt syntax is identical to blocking any crawler. To block all AI training crawlers individually:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: cohere-ai
Disallow: /

Blocking all crawlers except specific ones:

# Allow Google search indexing
User-agent: Googlebot
Allow: /

# Allow Bing indexing
User-agent: Bingbot
Allow: /

# Block all others (including unknown AI crawlers)
User-agent: *
Disallow: /

This approach is the most restrictive — blocks any crawler not explicitly allowed.

Which AI crawlers respect robots.txt?

Most reputable AI companies have stated they respect robots.txt. OpenAI, Anthropic, and Google have all published documentation indicating their crawlers follow robots.txt directives.

What "respect" means in practice:

Reputable large companies: generally compliant
Smaller, less established crawlers: compliance is less certain
Data scraping operations that misidentify as AI crawlers: robots.txt doesn't stop malicious actors

Robots.txt is a voluntary protocol — there's no technical enforcement. A crawler that ignores it can still access your site. Legal enforcement (ToS violations, copyright claims) is the recourse for non-compliant crawlers.

The business and ethical considerations

Arguments for blocking AI training crawlers:

Your content is used to train models that compete with your business (news publishers, creative professionals)
No compensation for your work being used to train commercial AI systems
Content that took significant effort to create is consumed without attribution or agreement
Bandwidth costs from heavy AI crawling

Arguments for allowing AI training crawlers:

AI-powered search (ChatGPT browsing, Perplexity) may drive traffic back to your site — blocking crawlers may reduce discoverability in these systems
Google-Extended blocking prevents your content from appearing in AI Overviews (AI-generated summaries in search results) — potentially reducing SERP visibility
The content is publicly accessible; training on publicly accessible data is legally unresolved

A nuanced middle ground:

Distinguish between crawlers used for training (consume content, no traffic return) and crawlers used for real-time search/retrieval (which may send users to your site):

# Block training crawlers
User-agent: GPTBot
Disallow: /

# Allow Perplexity (sends referral traffic)
User-agent: PerplexityBot
Allow: /

Google-Extended: the specific complexity

Google operates two separate crawlers:

Googlebot: used for Google Search indexing — blocking this removes you from Google Search entirely
Google-Extended: used specifically for Google's AI products (Bard/Gemini, AI Overviews) — can be blocked independently

This distinction is significant. Publishers can opt out of AI training use while remaining in search results:

# Still indexed in Google Search
User-agent: Googlebot
Allow: /

# Not used for AI Overviews or Gemini training
User-agent: Google-Extended
Disallow: /

Verifying compliance: checking server logs

After updating robots.txt, checking server access logs (or CDN logs) for AI crawler user agents confirms whether they're respecting the rules:

grep "GPTBot" /var/log/nginx/access.log | tail -20
grep "ClaudeBot" /var/log/nginx/access.log | wc -l

If crawling continues after a robots.txt block: contact the operator, review the specific user agent format (some crawlers have variant formats), or consider IP-level blocking as a more definitive measure.

How to use the Robots.txt Generator on sadiqbd.com

Configure allow/disallow rules for different user agents
Add AI crawlers — the generator includes common crawler names for easy selection
Set sitemap location — include your sitemap URL for discovery
Generate the file — ready to upload to your domain root

Frequently Asked Questions

Does blocking GPTBot prevent my site from appearing in ChatGPT responses? Blocking GPTBot prevents ChatGPT from crawling your site for training data. For real-time browsing (ChatGPT Plus with browsing), ChatGPT-User is the relevant user agent. Blocking both prevents all OpenAI crawling, which may mean less representation in ChatGPT outputs.

Is there a standard way to declare AI training opt-out beyond robots.txt? The TDM (Text and Data Mining) reservation in copyright law (EU DSM Directive, Article 4) allows rights holders to declare opt-out for TDM. Some organisations are exploring machine-readable opt-out standards similar to robots.txt but specific to AI training. As of 2024, robots.txt is the most universally understood mechanism.

Is the Robots.txt Generator free? Yes — completely free, no sign-up required.

AI training crawlers represent a new category of consideration for robots.txt beyond search engine indexing. The decision to allow or block is a business and values decision — robots.txt is simply the mechanism that enforces it.

Try the Robots.txt Generator free at sadiqbd.com — create correctly formatted robots.txt rules for search engines, AI crawlers, and any other bot.