Lesson 0002 · Crawl + Index stages

Crawlable ≠ Indexable

Two separate gates, two separate failure modes. Most "why isn't my page ranking?" bugs die here — before ranking is even in play.

Recap from Lesson 0001: classic search runs crawl → index → rank → serve. Today we zoom into the first two — because a page that fails either can never rank, and as a builder these are the two gates your tooling can check automatically.

Developers conflate these constantly. “Crawlable” and “indexable” are independent. A page can be perfectly crawlable and still refuse to be indexed. A page can be marked index-me and never get crawled. They are two different gates with two different controls.^[1]

Your win: run a script that audits any URL through both gates and tells you exactly which one is failing — the tightest possible feedback loop on “can this page even show up?”

The two gates

CRAWL

Can the bot fetch it?

Controlled before the page is read.

robots.txt must allow the path
URL must return 200 (not 404 / 5xx / endless redirect)
must be discoverable (sitemap or a link)

INDEX

Will the engine keep it?

Controlled inside the fetched page.

no noindex (meta tag or X-Robots-Tag header)
has real content + a <title>
a sane canonical (which URL is the "real" one)

The trap that breaks everyone: blocking a page in robots.txt does not remove it from the index. If the bot is blocked from crawling, it never reads your noindex tag — so to deindex a page you must allow the crawl and serve noindex. Block + noindex together = noindex never seen.^[2] Crawl control and index control point in opposite directions.

The checklist, and what controls each

Gate	Check	Control lives in
CRAWL	Path allowed	`/robots.txt`
CRAWL	Returns 200	HTTP status
INDEX	No `noindex`	`<meta name="robots">` or `X-Robots-Tag` header
INDEX	Has title	`<title>` in HTML
INDEX	Has canonical	`<link rel="canonical">`

The skill: audit a URL with a script

You already have it — tools/crawl_audit.py in this workspace. Stdlib only, no install. The core: urllib.robotparser answers the crawl gate, a tiny HTML parser answers the index gate.

# the heart of it — robots gate, then index signals
def robots_allows(url):
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(origin(url) + "/robots.txt")
    rp.read()
    return rp.can_fetch("Googlebot", url)   # stdlib does the parsing

# index gate: noindex? title? canonical?  (from the page's <head>)

Do this now —

the feedback loop is the lesson:

Self-check the parser (offline): python3 tools/crawl_audit.py --demo
Audit a page that passes — Google’s own docs: python3 tools/crawl_audit.py https://developers.google.com/search/docs/fundamentals/how-search-works
Now audit a page you own. Then deliberately break it: add <meta name="robots" content="noindex"> to a test page and watch the INDEX gate flip to FAIL while CRAWL still passes. That contrast is the lesson.

$ python3 tools/crawl_audit.py https://developers.google.com/search/docs/fundamentals/how-search-works

Audit: https://developers.google.com/.../how-search-works
──────────────────────────────────────────────
[PASS] CRAWL · robots.txt allows the bot
[PASS] CRAWL · page returns success (200)
[PASS] INDEX · no noindex directive
[PASS] INDEX · has a <title>
[PASS] INDEX · declares a canonical URL
──────────────────────────────────────────────
VERDICT: crawlable + indexable. (Ranking is a separate fight.)

Ceiling to know: this is a static fetch. If a page renders its content with JavaScript, urllib won’t see it — Google does a second render pass that this script skips. That gap (JS rendering) is a whole lesson later.

Retrieval practice · no peeking

Which gate, which control?

Answer from memory — that effort is what makes it stick. One try each; pick before you read the others.

Question 1 / 4

A page is blocked in robots.txt but has no noindex. What happens?

Question 2 / 4

You add noindex to a page that is also listed in your sitemap. Result?

Question 3 / 4

A page is fully crawlable but never appears in the index. Which stage is blocking it?

Question 4 / 4

Your script reads the X-Robots-Tag response header to detect…

Primary source — read this next (≈10 min)

“Block search indexing with noindex” — Google Search Central

States the block-vs-noindex trap directly: a blocked page can't have its noindex read. Also see the robots.txt introduction.

Stuck or curious? This agent is your teacher. Ask it anything — “show me a real robots.txt”, “do Claude and Perplexity retrieve differently?” — followups are the fastest way to learn.