CCBot (Common Crawl) — Definition
CCBot (Common Crawl)
TL;DR: CCBot is the crawler run by Common Crawl, a nonprofit that publishes a free, open archive of the web. That archive is a major source of AI training data, so blocking CCBot reduces your content’s reach into many models at once. It respects robots.txt.
What it means
CCBot is the user-agent of Common Crawl, a nonprofit organization that has crawled the web since 2008 and releases the results as a free, openly downloadable dataset. Common Crawl is not itself an AI company — but its corpus is one of the most widely used training-data sources in the industry, feeding many large language models indirectly. So CCBot sits in the training category even though Common Crawl’s own mission is open data, not AI.
Why it matters
CCBot is a high-leverage block for anyone trying to limit training use, because a single Disallow reduces your content’s presence in many downstream models rather than one company’s. The flip side: Common Crawl is also used for legitimate research and archiving, so blocking it is a broader decision than blocking one vendor’s bot. CCBot respects robots.txt, so the block is honored — though, as always, content already captured in past crawls remains in existing dataset releases.
How it works / examples
To block CCBot:
User-agent: CCBotDisallow: /Common Crawl is a nonprofit crawler, so unlike the big AI vendors it doesn’t publish a per-bot IP-range file for verification — another reason that, for hard enforcement against impersonators, a firewall layer matters more than the user-agent name (see glossary/waf).
Related
- glossary/ai-crawler — the three crawler types
- glossary/gptbot — the parallel decision for OpenAI’s training bot
- seo/ai-crawler-access — the full allow/block policy