Lesson 0010 · AI crawler controls

Who Gets to Train on You

Classic SEO had one bot to please. The AI era added a zoo of them — and a new decision per vendor: train on me, cite me, or neither?

Recap: Lesson 0002 used robots.txt to control one crawler — Googlebot. This lesson uses the same file to answer a question Googlebot never raised: now that a dozen AI crawlers fetch your pages, which of them do you let in, and for what?

There is no PASS/FAIL here. Allowing or blocking an AI crawler is a policy choice, not a bug — so the goal is to see your current posture clearly and set it deliberately, instead of inheriting whatever default your CMS shipped.

Your win: read your site’s exact ALLOW/BLOCK posture toward every major AI crawler — and understand the one distinction (train vs retrieve) that decides whether blocking helps you or quietly costs you citations.

A fetch can do three different jobs

The mistake is treating “AI bots” as one thing. A fetch falls into one of three buckets, and you may feel differently about each:

Job	What the fetch feeds	Example bots
train	a model’s training corpus	`GPTBot`, `Google-Extended`, `CCBot`, `ClaudeBot`, `Bytespider`
retrieve	a live answer it will cite	`OAI-SearchBot`, `PerplexityBot`
user-fetch	one page a user asked for	`ChatGPT-User`, `Claude-Web`

The decision that matters: citations — the whole AEO win from Lesson 0001 — come from the retrieve bots. Training comes from the train bots. They are separate crawlers, so you can refuse training and still be citable. Block the train bots, keep the retrieve bots open: opt out of the corpus without going invisible in answers.

Block to opt out of training

Disallow GPTBot, Google-Extended, CCBot, ClaudeBot.

Your content stops feeding model training.

Allow to stay citable

Keep OAI-SearchBot, PerplexityBot, and Googlebot open.

You remain eligible to be retrieved and cited in answers.

The Google-Extended trap: Google-Extended controls only whether your content trains Gemini / Vertex AI. It does not affect Google Search, and it does not control AI Overviews — those follow normal indexing.^[1] Blocking Google-Extended to “keep AI out” does not remove you from Search, and does not stop AI Overviews from quoting your indexed pages. Different lever than people think.

robots.txt is a request, not a wall

Every directive here is voluntary. A well-behaved crawler reads robots.txt and obeys; a hostile scraper ignores it entirely.^[2] If you must enforce a block, that’s a server-side job — a WAF, rate limits, or UA/IP rules — not a line in a text file.

# robots.txt — opt out of training, stay citable
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
Allow: /          # keep retrieval open → still citable

On llms.txt: you’ll see llms.txt pitched as “robots.txt for AI.” It’s a proposal, not a standard — no major engine enforces it, and Google has publicly called it unnecessary. Ship one if you like (it’s harmless), but don’t mistake it for a control that does anything today.

The skill: read your posture

tools/ai_bots.py reads your robots.txt the way each AI crawler would — urllib.robotparser answers can_fetch(bot, "/") per user-agent — and probes for llms.txt.

# the heart of it — ask robots.txt, once per AI bot
rp = urllib.robotparser.RobotFileParser()
rp.parse(robots_text.splitlines())
for ua, vendor, kind in AI_BOTS:
    posture = "ALLOW" if rp.can_fetch(ua, "/") else "BLOCK"

Do this now —

Self-check offline: python3 tools/ai_bots.py --demo
Read your own posture: python3 tools/ai_bots.py https://your-site.example/
Decide deliberately: are you citable (retrieve bots allowed) while controlling training (train bots set to taste)? Set it; don’t inherit it.

$ python3 tools/ai_bots.py https://acme.test/

AI-crawler posture: https://acme.test
──────────────────────────────────────────────
[FAIL] BLOCK · OpenAI       · GPTBot          train
[PASS] ALLOW · OpenAI       · OAI-SearchBot   retrieve
[PASS] ALLOW · Anthropic    · ClaudeBot       train
[FAIL] BLOCK · Google       · Google-Extended train
[PASS] ALLOW · Perplexity   · PerplexityBot   retrieve
[FAIL] BLOCK · Common Crawl · CCBot           train
[WARN] none  · (extra)      · llms.txt        non-standard
──────────────────────────────────────────────
VERDICT: 8/12 AI crawlers allowed. robots.txt is voluntary — only honored by bots that choose to.

Ceiling to know: this reports declared posture, not enforced reality — a bot that ignores robots.txt sails right past it. And the bot list is a moving target; vendors add and rename crawlers, so treat the roster as current-as-of, not permanent. The question the tool can’t answer for you is the only one that matters: what do you want trained on, and what do you want cited?

Retrieval practice · no peeking

Posture, not pass/fail

Answer from memory — that effort is what makes it stick. One try each; pick before you read the others.

Question 1 / 4

Google-Extended in your robots.txt controls…

Question 2 / 4

You want ChatGPT and Perplexity to cite you, but not train on you. You should…

Question 3 / 4

A crawler ignores your robots.txt and scrapes anyway. So robots.txt is…

Question 4 / 4

llms.txt is…

Primary source — read this next (≈12 min)

“AI features and your website” — Google Search Central

Google on the record about how AI features pick content and how to control inclusion — the primary source that grounds the Google-Extended distinction.

Stuck or curious? This agent is your teacher. Ask it anything — “show me a real robots.txt”, “do Claude and Perplexity retrieve differently?” — followups are the fastest way to learn.