SEO·AEO for builders

Lesson 0002 · Crawl + Index stages

Crawlable ≠ Indexable

Two separate gates, two separate failure modes. Most "why isn't my page ranking?" bugs die here — before ranking is even in play.

Recap from Lesson 0001: classic search runs crawl → index → rank → serve. Today we zoom into the first two — because a page that fails either can never rank, and as a builder these are the two gates your tooling can check automatically.

Developers conflate these constantly. “Crawlable” and “indexable” are independent. A page can be perfectly crawlable and still refuse to be indexed. A page can be marked index-me and never get crawled. They are two different gates with two different controls.[1]

Your win: run a script that audits any URL through both gates and tells you exactly which one is failing — the tightest possible feedback loop on “can this page even show up?”

The two gates

CRAWL

Can the bot fetch it?

Controlled before the page is read.

  • robots.txt must allow the path
  • URL must return 200 (not 404 / 5xx / endless redirect)
  • must be discoverable (sitemap or a link)
INDEX

Will the engine keep it?

Controlled inside the fetched page.

  • no noindex (meta tag or X-Robots-Tag header)
  • has real content + a <title>
  • a sane canonical (which URL is the "real" one)
The trap that breaks everyone: blocking a page in robots.txt does not remove it from the index. If the bot is blocked from crawling, it never reads your noindex tag — so to deindex a page you must allow the crawl and serve noindex. Block + noindex together = noindex never seen.[2] Crawl control and index control point in opposite directions.

The checklist, and what controls each

GateCheckControl lives in
CRAWLPath allowed/robots.txt
CRAWLReturns 200HTTP status
INDEXNo noindex<meta name="robots"> or X-Robots-Tag header
INDEXHas title<title> in HTML
INDEXHas canonical<link rel="canonical">

The skill: audit a URL with a script

You already have it — tools/crawl_audit.py in this workspace. Stdlib only, no install. The core: urllib.robotparser answers the crawl gate, a tiny HTML parser answers the index gate.

# the heart of it — robots gate, then index signals
def robots_allows(url):
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(origin(url) + "/robots.txt")
    rp.read()
    return rp.can_fetch("Googlebot", url)   # stdlib does the parsing

# index gate: noindex? title? canonical?  (from the page's <head>)
Do this now —

the feedback loop is the lesson:

  1. Self-check the parser (offline): python3 tools/crawl_audit.py --demo
  2. Audit a page that passes — Google’s own docs: python3 tools/crawl_audit.py https://developers.google.com/search/docs/fundamentals/how-search-works
  3. Now audit a page you own. Then deliberately break it: add <meta name="robots" content="noindex"> to a test page and watch the INDEX gate flip to FAIL while CRAWL still passes. That contrast is the lesson.
$ python3 tools/crawl_audit.py https://developers.google.com/search/docs/fundamentals/how-search-works

Audit: https://developers.google.com/.../how-search-works
──────────────────────────────────────────────
[PASS] CRAWL · robots.txt allows the bot
[PASS] CRAWL · page returns success (200)
[PASS] INDEX · no noindex directive
[PASS] INDEX · has a <title>
[PASS] INDEX · declares a canonical URL
──────────────────────────────────────────────
VERDICT: crawlable + indexable. (Ranking is a separate fight.)

Ceiling to know: this is a static fetch. If a page renders its content with JavaScript, urllib won’t see it — Google does a second render pass that this script skips. That gap (JS rendering) is a whole lesson later.

Retrieval practice · no peeking

Which gate, which control?

Answer from memory — that effort is what makes it stick. One try each; pick before you read the others.

Question 1 / 4
A page is blocked in robots.txt but has no noindex. What happens?
Question 2 / 4
You add noindex to a page that is also listed in your sitemap. Result?
Question 3 / 4
A page is fully crawlable but never appears in the index. Which stage is blocking it?
Question 4 / 4
Your script reads the X-Robots-Tag response header to detect…
Primary source — read this next (≈10 min)
“Block search indexing with noindex” — Google Search Central

States the block-vs-noindex trap directly: a blocked page can't have its noindex read. Also see the robots.txt introduction.

Stuck or curious? This agent is your teacher. Ask it anything — “show me a real robots.txt”, “do Claude and Perplexity retrieve differently?” — followups are the fastest way to learn.