AI bots in robots.txt: GPTBot, ClaudeBot, PerplexityBot — allow or block?

A complete list of every relevant AI bot in 2026, their role, and when to allow or block each. With concrete robots.txt examples for different strategies.

Written by

Richard van Leeuwen

Founder of Priso. 30+ years of web dev and e-commerce, full-time AI tools since 2022.

6 May 2026·6 min read

In 2026 there are at least 12 AI bots you need to know. Some train models, others index for AI search, others fetch in real time during a user's chat. Mix them up and you'll accidentally block your own citations.

Here's the full list, their role, and when to allow or block each.

The 12 bots that matter

OpenAI (3 bots)

| Bot | User-agent | Role | |---|---|---| | GPTBot | GPTBot | Training data for future models | | OAI-SearchBot | OAI-SearchBot | Builds the search index for ChatGPT Search | | ChatGPT-User | ChatGPT-User | Real-time fetch when a user requests your site |

Anthropic (3 bots)

| Bot | User-agent | Role | |---|---|---| | ClaudeBot | ClaudeBot | Training data for Claude models | | Claude-SearchBot | Claude-SearchBot | Search index for Claude.ai with web access | | Claude-User | Claude-User | Real-time fetch during a Claude conversation |

Perplexity (2 bots)

| Bot | User-agent | Role | |---|---|---| | PerplexityBot | PerplexityBot | Primary crawler for Perplexity's index | | Perplexity-User | Perplexity-User | Real-time fetch during query |

Google (1 bot, in addition to Googlebot)

| Bot | User-agent | Role | |---|---|---| | Google-Extended | Google-Extended | Training for Gemini and Vertex AI |

Other relevant ones

| Bot | User-agent | Role | |---|---|---| | CCBot | CCBot | Common Crawl, used by many AI training datasets | | Bytespider | Bytespider | ByteDance/TikTok AI training | | Applebot-Extended | Applebot-Extended | Apple Intelligence training | | Meta-ExternalAgent | Meta-ExternalAgent | Meta AI (Llama) related | | Amazonbot | Amazonbot | Amazon's AI training |

Pulling apart the three functions

Important to understand: training, indexing, and real-time fetching are three different things.

Training = your content is used to train future models. No direct traffic, no citation, long-term impact on model knowledge.
Indexing = your content is added to a search index. AI answers cite from there. Direct effect on visibility.
Real-time fetch = when someone explicitly asks an AI to read your URL, the bot fetches the page live. Direct effect on how well your content shows up in that answer.

Many sites block all three OpenAI or Anthropic functions, thinking "no training data given out". Result: they're also not cited in ChatGPT or Claude answers, because the search and real-time bots are blocked too.

Strategy 1: maximum AI visibility

For SaaS, e-commerce, content businesses that want to be cited:

# Allow all AI search and retrieval bots
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# Allow training (optional, your call)
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

# Block non-compliant scrapers
User-agent: Bytespider
Disallow: /

# Default
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Strategy 2: visibility without training

Want to be cited but keep training data out? Block training bots, allow search and real-time bots:

# Allow search + retrieval, block training
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Strategy 3: lock everything down

For sites with paid content, sensitive data, or a deliberate "no AI" stance:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Note: this makes you invisible in all AI answers. For most commercial sites, that's expensive. Research shows AI referral traffic converts 3x better than classic organic. Weigh carefully before locking it down.

Strategy 4: mixed by path

Selective: leave marketing content open, block login/account pages and paid content:

User-agent: GPTBot
Allow: /
Disallow: /account/
Disallow: /api/
Disallow: /pro/
Disallow: /private/

User-agent: ClaudeBot
Allow: /
Disallow: /account/
Disallow: /api/
Disallow: /pro/

User-agent: PerplexityBot
Allow: /
Disallow: /account/
Disallow: /api/

User-agent: *
Allow: /
Disallow: /api/

This is what we run on flashcards.nl: open content open, transactional paths closed.

A note on compliance and stealth crawlers

Not all bots respect robots.txt:

Bytespider (ByteDance) — caught multiple times ignoring robots.txt
Perplexity — Wired published August 2024 research on a stealth crawler bypassing robots.txt; situation has improved but isn't 100% solved in 2026
Various scrapers without a clear org behind them

For real-world protection against non-compliant bots you need a server layer:

Cloudflare WAF with the "AI Scrapers and Crawlers" category enabled
Vercel firewall rules on user-agent
Your own rate limiting at IP + user-agent level

Robots.txt is a request, not a wall. For real blocks: server side.

Future claim implications

A nuance often overlooked: if you block an AI bot, it's harder later to claim they used your content without permission. The New York Times v. OpenAI case partly hinges on whether NYT had refused consent. Site owners who explicitly blocked GPTBot have a stronger legal position.

At the same time: blocked = not cited. It's a trade-off. For most commercial sites, visibility outweighs future claim leverage. For publishers with substantial exclusive content, it can go the other way.

How to check your current state

Three checks:

1. Open your robots.txt directly in the browser https://yourdomain.com/robots.txt — read what's there. Often surprising.

2. Check your Cloudflare/CDN settings Cloudflare's "AI Scrapers" preset blocks many AI bots by default. Many site owners don't know this. Cloudflare Dashboard → Security → Bots → AI Scrapers.

3. Run a Priso scan We check both your robots.txt and the actual fetch response per AI bot. The gap between what you think is configured and what bots actually see is bigger than you'd expect.

Common mistakes we see

Two recurring patterns:

Pattern 1: old robots.txt with only User-agent: * Misses every AI-specific bot. Result: everything is open, including training. Fine for some, not for others.

Pattern 2: accidental blocking via Cloudflare Site owner thinks: "robots.txt allows everything." Cloudflare's bot preset blocks GPTBot, ClaudeBot, PerplexityBot via WAF rules. Robots.txt isn't ground truth — the actual server response is.

What we recommend

For 90% of SaaS, e-commerce, and content businesses: strategy 1 or 2. Maximum AI visibility, optionally without training. Traffic and visibility outweigh future claim leverage.

For publishers, paid content, or heavily regulated sectors (legal, medical with paywalled content): strategy 3 or 4.

Decide based on business goals, not based on a blog post from a tool vendor. As a default direction though: AI traffic is a growing channel that converts 3x better than classic organic. The default should be "open", with explicit blocks where needed.

Check which bots actually reach your site

FAQ

How often should I update robots.txt? On every new AI engine launch, and at least quarterly review. The AI bot landscape moves fast.

Does User-agent: * work for AI bots? Some AI bots respect * rules, but most have specific documentation and only respect their own user-agent string. Write each bot out explicitly.

What happens if I forget a new bot? Default = your bot strategy for User-agent: *. If that's Allow: /, the new bot gets in. If Disallow: /, not. Choose deliberately.

Can I differ per page? Not via robots.txt (path-level only). For page-level blocks: use a <meta name="robots" content="noai, noimageai"> tag in your HTML head.

Written by Richard van Leeuwen, founder of Priso. Manages robots.txt strategies for flashcards.nl and priso.nl, reviews robots.txt files in every Priso audit.