Lesson 0002 · Crawl + Index stages
Crawlable ≠ Indexable
Two separate gates, two separate failure modes. Most "why isn't my page ranking?" bugs die here — before ranking is even in play.
Recap from Lesson 0001: classic search runs crawl → index → rank → serve. Today we zoom into the first two — because a page that fails either can never rank, and as a builder these are the two gates your tooling can check automatically.
Developers conflate these constantly. “Crawlable” and “indexable” are independent. A page can be perfectly crawlable and still refuse to be indexed. A page can be marked index-me and never get crawled. They are two different gates with two different controls.[1]
The two gates
Can the bot fetch it?
Controlled before the page is read.
robots.txtmust allow the path- URL must return 200 (not 404 / 5xx / endless redirect)
- must be discoverable (sitemap or a link)
Will the engine keep it?
Controlled inside the fetched page.
- no
noindex(meta tag orX-Robots-Tagheader) - has real content + a
<title> - a sane
canonical(which URL is the "real" one)
robots.txt does not remove it from the index. If the bot is blocked from crawling, it never reads your noindex tag — so to deindex a page you must allow the crawl and serve noindex. Block + noindex together = noindex never seen.[2] Crawl control and index control point in opposite directions. The checklist, and what controls each
| Gate | Check | Control lives in |
|---|---|---|
| CRAWL | Path allowed | /robots.txt |
| CRAWL | Returns 200 | HTTP status |
| INDEX | No noindex | <meta name="robots"> or X-Robots-Tag header |
| INDEX | Has title | <title> in HTML |
| INDEX | Has canonical | <link rel="canonical"> |
The skill: audit a URL with a script
You already have it — tools/crawl_audit.py in this workspace. Stdlib only, no install. The core: urllib.robotparser answers the crawl gate, a tiny HTML parser answers the index gate.
# the heart of it — robots gate, then index signals
def robots_allows(url):
rp = urllib.robotparser.RobotFileParser()
rp.set_url(origin(url) + "/robots.txt")
rp.read()
return rp.can_fetch("Googlebot", url) # stdlib does the parsing
# index gate: noindex? title? canonical? (from the page's <head>)
the feedback loop is the lesson:
- Self-check the parser (offline):
python3 tools/crawl_audit.py --demo - Audit a page that passes — Google’s own docs:
python3 tools/crawl_audit.py https://developers.google.com/search/docs/fundamentals/how-search-works - Now audit a page you own. Then deliberately break it: add
<meta name="robots" content="noindex">to a test page and watch the INDEX gate flip to FAIL while CRAWL still passes. That contrast is the lesson.
$ python3 tools/crawl_audit.py https://developers.google.com/search/docs/fundamentals/how-search-works Audit: https://developers.google.com/.../how-search-works ────────────────────────────────────────────── [PASS] CRAWL · robots.txt allows the bot [PASS] CRAWL · page returns success (200) [PASS] INDEX · no noindex directive [PASS] INDEX · has a <title> [PASS] INDEX · declares a canonical URL ────────────────────────────────────────────── VERDICT: crawlable + indexable. (Ranking is a separate fight.)
Ceiling to know: this is a static fetch. If a page renders its content with JavaScript, urllib won’t see it — Google does a second render pass that this script skips. That gap (JS rendering) is a whole lesson later.
Retrieval practice · no peeking
Which gate, which control?
Answer from memory — that effort is what makes it stick. One try each; pick before you read the others.
robots.txt but has no noindex. What happens?noindex to a page that is also listed in your sitemap. Result?X-Robots-Tag response header to detect…States the block-vs-noindex trap directly: a blocked page can't have its noindex read. Also see the robots.txt introduction.