Skip to content

AI Crawler Access Control — Bot Taxonomy, robots.txt vs WAF

AI Crawler Access Control

TL;DR: Not all AI bots are the same. Training bots turn your content into model weights and give nothing back; retrieval/search bots turn it into cited answers with referral links; user-fetch bots fire when a real person asks an AI to open your page (live visitor intent — never block these). The standard play: Disallow training agents in robots.txt, Allow retrieval + user-fetch. But the load-bearing caveat is enforcement: robots.txt only asks compliant bots to stay away; a WAF actually blocks the request at the edge. Real enforcement is the firewall, not the txt file — non-compliant scrapers (spoofed UAs, rotating IPs) ignore robots.txt entirely.

Why this matters

AI bots are now a meaningful share of web traffic, and the access decision has real consequences in both directions. Block too much and you lose AI visibility (your content can’t be cited if the citation bot can’t read it). Block too little and your content trains competitors’ models with no credit or traffic in return. Getting this right requires knowing which kind of bot is knocking — and knowing that the tool most people reach for (robots.txt) is the weakest form of control.

This page sits adjacent to the AI Visibility Audit skill and is reusable for lead-gen audits and GEO projects.

The bot taxonomy (3 core categories + 2 emerging)

The major vendors now expose separate user-agent tokens for each functional role — the taxonomy isn’t just an outside framing; the bots were built around it.

CategoryWhat it doesTraffic/credit back?Default stance
TrainingFetches content to build/refine model weightsNoneBlock (if you don’t want to train models for free)
Retrieval / searchFetches to build a citation index for AI-search answersYes — citations + referral linksAllow
User-fetchFires when a real person asks the AI to open a specific pageYes — represents a live visitorNever block

Two emerging categories a 2026 page should flag (tokens not yet fully stabilized):

  • Agent / action bots — autonomous agents that do tasks (browse, shop, fill forms) on a user’s behalf. Distinct from user-fetch because they take multi-step actions. As of mid-2026 these mostly run via existing browser-agent fetchers rather than a clean dedicated robots.txt token.
  • Ads / safety-validation bots — e.g. OpenAI’s OAI-AdsBot, which validates ad landing pages. Neither training, search, nor user-fetch.

Cloudflare’s own classification aligns with this split (separating an “AI Crawler” / training category from “AI Search” and AI-agent categories), though it doesn’t use identical labels.

Enforcement: a WAF acts, robots.txt only asks

This is the most under-known and load-bearing part of access control.

  • robots.txt is a voluntary honor system. Per IETF RFC 9309, compliance is voluntary and “does not constitute access control.” Cloudflare’s analogy: it’s “a Code of Conduct sign posted at a community pool.” A polite bot reads it and self-restricts; nothing forces it to.
  • A WAF (web application firewall) enforces. A request matching a blocked rule gets a 403 (or is dropped) at the edge, before it reaches your origin — regardless of what robots.txt says. The firewall never consults robots.txt.
  • They live at different layers. It’s not that “the WAF rule is parsed before the robots.txt file” — they’re separate mechanisms. If both exist, the WAF wins by construction: it intercepts the request before the bot’s own robots.txt logic is even relevant.
  • robots.txt only governs compliant bots. Non-compliant scrapers — Bytespider is the canonical example, plus stealth crawlers that spoof user-agents, rotate ASNs/IPs, and publish no IP ranges — require WAF rules + IP/ASN blocks + behavioral heuristics.

Bottom line: if you actually need to keep a bot out, the firewall is the enforcement layer; robots.txt is a politeness request that well-behaved bots honor. And UA-string matching alone is weak (UAs are trivially spoofed) — pair it with published IP-range verification (the vendor *.json files below) for anything you want genuinely enforced.

The Perplexity stealth-crawling case (Aug 2025)

A concrete illustration: in August 2025, Cloudflare reported that Perplexity was observed rotating user-agents and source ASNs to disguise crawling — using a generic “Chrome on macOS” UA outside its declared IP ranges when its declared crawler was blocked, and ignoring robots.txt. Cloudflare de-listed Perplexity from its verified-bots list and added managed heuristics. Perplexity denied it, calling the report a “sales pitch.” The episode is the textbook demonstration of “robots.txt only governs bots that choose to comply” — the response to a stealth crawler is WAF heuristics + IP/ASN blocking, not a txt directive.

Current user-agent strings (as of mid-2026)

Dated layer — expect drift. UA tokens, versions, and IP-range publication change. Casing and spelling are load-bearing for robots.txt rules — copy exactly. Re-verify against vendor docs before deploying.

Training crawlers (typically Disallow)

TokenVendorNotes
GPTBotOpenAIGPTBot/1.3. IP ranges published (openai.com/gptbot.json).
ClaudeBotAnthropicClaudeBot/1.0. IP ranges now published (claude.com/crawling/bots.json, rev. Apr 2026) — older “Anthropic publishes no IP ranges” claims are outdated.
CCBotCommon CrawlNonprofit crawler whose dumps feed many models. Respects robots.txt. No vendor IP-range JSON.
Google-ExtendedGoogleOpt-out token, not a crawler — makes no requests, has no UA of its own. Governs whether Googlebot-fetched content trains Gemini/Vertex.
Applebot-ExtendedAppleOpt-out token, not a crawler — a signal read by the existing Applebot UA controlling Apple Intelligence training use.
AmazonbotAmazonRespects robots.txt. Note a sibling Amzn-SearchBot is described as search-only (Alexa/Rufus) — verify before treating them identically.
Meta-ExternalAgentMetameta-externalagent/1.1. Meta AI / Llama training crawler (launched Jul 2024). FacebookBot is a separate, older crawler.

Retrieval / search crawlers (typically Allow)

TokenVendorNotes
OAI-SearchBotOpenAIOAI-SearchBot/1.3. IP ranges published (openai.com/searchbot.json).
Claude-SearchBotAnthropicClaude-SearchBot/1.0. IP ranges in shared bots.json.
PerplexityBotPerplexitySearch/citation indexer (“not used for training”). IP ranges published — but see the Aug 2025 verified-bot de-listing above.

User-fetch agents (never block)

TokenVendorNotes
ChatGPT-UserOpenAIChatGPT-User/1.0. IP ranges published (openai.com/chatgpt-user.json).
Claude-UserAnthropicClaude-User/1.0. IP ranges in shared bots.json.
Perplexity-UserPerplexityIgnores robots.txt by design (user-triggered). IP ranges published.
Meta-ExternalFetcherMetaReal-time retrieval when users ask Meta AI.

Others worth knowing

  • OAI-AdsBot (OpenAI) — ad landing-page validation. New in 2026.
  • Google-CloudVertexBot (Google) — a real crawler (unlike Google-Extended) collecting for Cloud Vertex AI Search; controllable via robots.txt.
  • Agentic-commerce bots — no widely-documented dedicated “shopping bot” UA had stabilized as of June 2026 (OpenAI’s Instant Checkout was wound down Mar 2026; Google’s Universal Commerce Protocol launched Jan 2026). These run as protocol/API integrations or via browser-agent fetchers — fastest-moving gap; don’t list a confirmed shopping UA yet.

robots.txt setup logic — and its one real tradeoff

The recommended pattern: Disallow training agents; Allow retrieval/search + user-fetch agents. Rationale: training takes content and returns nothing; retrieval and user-fetch can drive citations and referral traffic.

The tradeoff is genuine and load-bearing: this clean separation only holds where the vendor exposes separate training vs. search tokens. OpenAI and Anthropic do this well — you can block GPTBot while allowing OAI-SearchBot. Where a vendor uses one token for both, or an opt-out token (Google-Extended, Applebot-Extended) that only governs training use rather than blocking a crawler, the calculus differs. A Disallow aimed at an opt-out token doesn’t “block a crawler” — it changes downstream data use. Users routinely conflate the two mechanisms.

What robots.txt and llms.txt can’t do

  • llms.txt is advisory, not access control. It’s an emerging proposed standard — a curated markdown map of your site for LLMs — but it carries the same honor-system limitation as robots.txt and is not consistently honored. Useful for helping compliant AI understand your site; useless as an enforcement mechanism. (See seo/agentic-search-optimization for llms.txt’s visibility role.)
  • The infra-layer shift (Cloudflare, Jul 2025). Cloudflare became the first major infrastructure provider to block AI crawlers by default, and introduced a pay-per-crawl model using HTTP 402 Payment Required plus crawler-price / crawler-max-price headers. As of 2026 it fronts a large share of the public web with one-click AI blocking. This reframes “access control” from a per-site config problem into an infra-layer / marketplace problem — increasingly the real control point sits at the CDN, not in your repo.
  • The direction of travel is cryptographic/verified-bot signing and IP-range verification, because UA-string matching is spoofable.

Key Takeaways

  • Three core bot categories: training (block — no credit back), retrieval/search (allow — drives citations), user-fetch (never block — live visitor). Plus emerging agent and ads-validation bots.
  • robots.txt asks; a WAF enforces. robots.txt only governs compliant bots; the firewall blocks at the edge regardless. Real enforcement = WAF + IP/ASN blocks, not the txt file.
  • UA-string blocking alone is weak — pair with vendor-published IP-range JSON verification.
  • Standard pattern: Disallow training UAs, Allow retrieval + user-fetch UAs — but this clean split only holds where the vendor separates training vs. search tokens.
  • Google-Extended and Applebot-Extended are opt-out signals, not crawlers (no UA, no requests).
  • Anthropic now publishes IP ranges (claude.com/crawling/bots.json); the Perplexity stealth-crawling incident was Aug 2025; Perplexity-User ignores robots.txt by design.
  • The control point is moving to the infra layer: Cloudflare default-blocks AI crawlers and offers HTTP-402 pay-per-crawl.

Sources

Tier 1 — official vendor / infra docs

Tier 2 — reputable trade / security press