Lesson 0009 · Capstone

The Site Audit

Eight lessons checked one signal on one URL. A real site has hundreds of pages. Crawl it like a bot and run every gate at once.

Recap: Lessons 0002–0008 each took one signal — robots/noindex, title, canonical, structured data, an <h1>, body depth — and checked it on a single URL. That’s the unit test. This is the integration test.

A real site is hundreds of pages, and the signal that’s broken is rarely broken on the page you happened to look at — it’s broken in a template, so it’s broken on every page that uses it. To find that, you can’t audit one URL. You have to crawl the site the way a bot does and run every gate on every page at once.

Your win: point one script at a domain and get back a per-page list and a site-wide rollup — “canonical missing on 2/3 pages” — that turns a vague worry into a ranked fix list.

Crawl like a bot: BFS from a seed

A crawler doesn’t have your sitemap memorized. It starts at one URL, reads the links, queues the ones it hasn’t seen, and repeats — breadth-first — staying on the site. Two rules keep it honest:

One domain. Follow links whose host matches the seed (via seolib.domain); skip off-site links — other people’s sites aren’t yours to audit.
A cap. Stop after N pages. A polite crawl is bounded; an unbounded one hammers the server you’re trying to help.

# the heart of it — BFS over internal links, capped
host = domain(seed)
seen, out, queue = set(), [], [seed]
while queue and len(out) < max_pages:
    url = queue.pop(0)
    if url in seen: continue
    seen.add(url)
    resp = fetch(url, ua="Googlebot")
    pg = parse(resp.body)                      # one seolib parse → every signal
    out.append((url, resp, pg))
    for href in pg.links:
        nxt = urljoin(url, href)               # relative → absolute
        if domain(nxt) == host and nxt not in seen:
            queue.append(nxt)                  # internal + unseen → enqueue

Every gate, every page

For each crawled page, run the whole checklist — the same predicates the per-lesson tools owned, now in one table:

Gate	Check	First seen in
CRAWL	page returns success (2xx/3xx)	`crawl_audit.py` · 0002
INDEX	no `noindex` (meta or `X-Robots-Tag`)	0002
INDEX	has a `<title>`	0002
INDEX	declares a `canonical`	0002
INDEX	exactly one `<h1>`	structure · 0004
INDEX	has JSON-LD structured data	`schema_tool.py` · 0003
INDEX	enough body text to retrieve a chunk	`geo_lint.py` · 0004

The rollup is the point: one page failing a check is a typo. The same check failing across many pages is a template bug — and that’s the highest-leverage fix you have, because correcting one layout file heals every page at once. The per-page list finds the typo; the site rollup finds the template bug.

Run it

Do this now —

the capstone pulls every prior lesson into one command:

Self-check offline (crawls a canned 3-page site): python3 tools/site_audit.py --demo
Audit a small real site you own: python3 tools/site_audit.py https://your-site.example/
Read the rollup, not just the per-page list. The check with the highest “N/total failing” is your first fix — almost always a template, not a page.

$ python3 tools/site_audit.py https://acme.test/

Site audit: https://acme.test/ (3 pages)
──────────────────────────────────────────────
[PASS] clean      · https://acme.test/
[WARN] 2 issue(s) · https://acme.test/pricing
[WARN] 6 issue(s) · https://acme.test/blog/x
──────────────────────────────────────────────
VERDICT: 2/3 pages have at least one gate failing — fix worst first.

$ # …findings rolled up across the site
[PASS] CRAWL    · page returns success        all pages OK
[WARN] INDEX    · no noindex directive        1/3 pages failing
[WARN] INDEX    · has a <title>               1/3 pages failing
[WARN] INDEX    · declares a canonical        2/3 pages failing
[WARN] INDEX    · exactly one <h1>            1/3 pages failing
[WARN] INDEX    · has JSON-LD structured data 2/3 pages failing
[WARN] INDEX    · enough body text            1/3 pages failing
──────────────────────────────────────────────
VERDICT: canonical + JSON-LD fail on 2/3 — one template fix clears both.

Ceilings to know: this is a static fetch (no browser), so JS-rendered content is invisible — the render gap from Lesson 0006. It walks only links it can find (orphan pages need your sitemap). And it audits machine-checkable signals — not whether the writing is any good. The map is not the territory; it just tells you which gates are shut.

Retrieval practice · no peeking

Audit the whole site, not one page

Answer from memory — that effort is what makes it stick. One try each; pick before you read the others.

Question 1 / 4

How does the auditor decide which pages to check?

Question 2 / 4

Why does the crawl skip links to other-site.com?

Question 3 / 4

The rollup says canonical — 2/3 pages failing. The builder fix is usually to…

Question 4 / 4

The audit fetches each page with urllib (no browser). What can it miss?

Primary source — read this next (≈15 min)

“SEO Starter Guide: The Basics” — Google Search Central

Google's own end-to-end list of what a site should get right — the same gates this capstone audits, from the source that defines them.

Stuck or curious? This agent is your teacher. Ask it anything — “show me a real robots.txt”, “do Claude and Perplexity retrieve differently?” — followups are the fastest way to learn.