AI Crawler — Definition

AI Crawler

TL;DR: An AI crawler is an automated bot that fetches web pages on behalf of an AI system — either to train a model, to build a citation index for AI search answers, or to fetch a specific page a real user asked about.

What it means

An AI crawler is software that requests and reads your web pages for an AI company, the way Googlebot reads pages for Google Search. The label covers three functionally different jobs, and the difference is what determines whether you should welcome or block any given crawler. A training crawler turns your content into model weights. A retrieval (or search) crawler builds an index used to answer questions and cite sources. A user-fetch agent fires when a person asks an AI to open your specific page. Same mechanism — a bot making an HTTP request — but three very different deals for the site owner.

Why it matters

Treating all AI crawlers as one thing is the most common access-control mistake. Block everything and you disappear from AI search answers and lose live visitors who arrive via AI assistants; block nothing and your content trains competitors’ models with no credit or traffic back. The right policy is per-type, not all-or-nothing — see seo/ai-crawler-access for the full decision. For example, blocking OpenAI’s glossary/gptbot (training) while allowing OAI-SearchBot (retrieval) keeps you out of training sets but visible in ChatGPT search.

How it works / examples

Training: glossary/gptbot (OpenAI), ClaudeBot (Anthropic), glossary/ccbot (Common Crawl) — fetch broadly to build datasets.
Retrieval / search: OAI-SearchBot, PerplexityBot — fetch to cite you in AI answers with a link back.
User-fetch: ChatGPT-User, Perplexity-User — fire when a real person pastes your URL into an AI. These are visitors; never block them.

A crawler announces itself with a user-agent string — but those are easily faked (see glossary/bytespider), so real identification uses each operator’s published IP ranges.

seo/ai-crawler-access — the full taxonomy, enforcement, and current user-agent tables
glossary/gptbot — the most-discussed training crawler
glossary/llms-txt — the advisory file that helps AI understand (not control) your site
seo/ai-visibility — getting found by the crawlers you allow

Sources

seo/ai-crawler-access — internal synthesis
OpenAI — Bots / Crawlers documentation