Lesson 0010 · AI crawler controls
Who Gets to Train on You
Classic SEO had one bot to please. The AI era added a zoo of them — and a new decision per vendor: train on me, cite me, or neither?
Recap: Lesson 0002 used robots.txt to control one crawler — Googlebot. This lesson uses the same file to answer a question Googlebot never raised: now that a dozen AI crawlers fetch your pages, which of them do you let in, and for what?
There is no PASS/FAIL here. Allowing or blocking an AI crawler is a policy choice, not a bug — so the goal is to see your current posture clearly and set it deliberately, instead of inheriting whatever default your CMS shipped.
A fetch can do three different jobs
The mistake is treating “AI bots” as one thing. A fetch falls into one of three buckets, and you may feel differently about each:
| Job | What the fetch feeds | Example bots |
|---|---|---|
| train | a model’s training corpus | GPTBot, Google-Extended, CCBot, ClaudeBot, Bytespider |
| retrieve | a live answer it will cite | OAI-SearchBot, PerplexityBot |
| user-fetch | one page a user asked for | ChatGPT-User, Claude-Web |
Disallow GPTBot, Google-Extended, CCBot, ClaudeBot.
Keep OAI-SearchBot, PerplexityBot, and Googlebot open.
Google-Extended controls only whether your content trains Gemini / Vertex AI. It does not affect Google Search, and it does not control AI Overviews — those follow normal indexing.[1] Blocking Google-Extended to “keep AI out” does not remove you from Search, and does not stop AI Overviews from quoting your indexed pages. Different lever than people think. robots.txt is a request, not a wall
Every directive here is voluntary. A well-behaved crawler reads robots.txt and obeys; a hostile scraper ignores it entirely.[2] If you must enforce a block, that’s a server-side job — a WAF, rate limits, or UA/IP rules — not a line in a text file.
# robots.txt — opt out of training, stay citable
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: OAI-SearchBot
Allow: / # keep retrieval open → still citable
llms.txt pitched as “robots.txt for AI.” It’s a proposal, not a standard — no major engine enforces it, and Google has publicly called it unnecessary. Ship one if you like (it’s harmless), but don’t mistake it for a control that does anything today. The skill: read your posture
tools/ai_bots.py reads your robots.txt the way each AI crawler would — urllib.robotparser answers can_fetch(bot, "/") per user-agent — and probes for llms.txt.
# the heart of it — ask robots.txt, once per AI bot
rp = urllib.robotparser.RobotFileParser()
rp.parse(robots_text.splitlines())
for ua, vendor, kind in AI_BOTS:
posture = "ALLOW" if rp.can_fetch(ua, "/") else "BLOCK"
- Self-check offline:
python3 tools/ai_bots.py --demo - Read your own posture:
python3 tools/ai_bots.py https://your-site.example/ - Decide deliberately: are you citable (retrieve bots allowed) while controlling training (train bots set to taste)? Set it; don’t inherit it.
$ python3 tools/ai_bots.py https://acme.test/ AI-crawler posture: https://acme.test ────────────────────────────────────────────── [FAIL] BLOCK · OpenAI · GPTBot train [PASS] ALLOW · OpenAI · OAI-SearchBot retrieve [PASS] ALLOW · Anthropic · ClaudeBot train [FAIL] BLOCK · Google · Google-Extended train [PASS] ALLOW · Perplexity · PerplexityBot retrieve [FAIL] BLOCK · Common Crawl · CCBot train [WARN] none · (extra) · llms.txt non-standard ────────────────────────────────────────────── VERDICT: 8/12 AI crawlers allowed. robots.txt is voluntary — only honored by bots that choose to.
Ceiling to know: this reports declared posture, not enforced reality — a bot that ignores robots.txt sails right past it. And the bot list is a moving target; vendors add and rename crawlers, so treat the roster as current-as-of, not permanent. The question the tool can’t answer for you is the only one that matters: what do you want trained on, and what do you want cited?
Retrieval practice · no peeking
Posture, not pass/fail
Answer from memory — that effort is what makes it stick. One try each; pick before you read the others.
Google-Extended in your robots.txt controls…robots.txt and scrapes anyway. So robots.txt is…llms.txt is…Google on the record about how AI features pick content and how to control inclusion — the primary source that grounds the Google-Extended distinction.