SEOΒ·AEO for builders

Lesson 0009 Β· Capstone

The Site Audit

Eight lessons checked one signal on one URL. A real site has hundreds of pages. Crawl it like a bot and run every gate at once.

Recap: Lessons 0002–0008 each took one signal β€” robots/noindex, title, canonical, structured data, an <h1>, body depth β€” and checked it on a single URL. That’s the unit test. This is the integration test.

A real site is hundreds of pages, and the signal that’s broken is rarely broken on the page you happened to look at β€” it’s broken in a template, so it’s broken on every page that uses it. To find that, you can’t audit one URL. You have to crawl the site the way a bot does and run every gate on every page at once.

Your win: point one script at a domain and get back a per-page list and a site-wide rollup β€” β€œcanonical missing on 2/3 pages” β€” that turns a vague worry into a ranked fix list.

Crawl like a bot: BFS from a seed

A crawler doesn’t have your sitemap memorized. It starts at one URL, reads the links, queues the ones it hasn’t seen, and repeats β€” breadth-first β€” staying on the site. Two rules keep it honest:

# the heart of it β€” BFS over internal links, capped
host = domain(seed)
seen, out, queue = set(), [], [seed]
while queue and len(out) < max_pages:
    url = queue.pop(0)
    if url in seen: continue
    seen.add(url)
    resp = fetch(url, ua="Googlebot")
    pg = parse(resp.body)                      # one seolib parse β†’ every signal
    out.append((url, resp, pg))
    for href in pg.links:
        nxt = urljoin(url, href)               # relative β†’ absolute
        if domain(nxt) == host and nxt not in seen:
            queue.append(nxt)                  # internal + unseen β†’ enqueue

Every gate, every page

For each crawled page, run the whole checklist β€” the same predicates the per-lesson tools owned, now in one table:

GateCheckFirst seen in
CRAWLpage returns success (2xx/3xx)crawl_audit.py Β· 0002
INDEXno noindex (meta or X-Robots-Tag)0002
INDEXhas a <title>0002
INDEXdeclares a canonical0002
INDEXexactly one <h1>structure Β· 0004
INDEXhas JSON-LD structured dataschema_tool.py Β· 0003
INDEXenough body text to retrieve a chunkgeo_lint.py Β· 0004
The rollup is the point: one page failing a check is a typo. The same check failing across many pages is a template bug β€” and that’s the highest-leverage fix you have, because correcting one layout file heals every page at once. The per-page list finds the typo; the site rollup finds the template bug.

Run it

Do this now β€”

the capstone pulls every prior lesson into one command:

  1. Self-check offline (crawls a canned 3-page site): python3 tools/site_audit.py --demo
  2. Audit a small real site you own: python3 tools/site_audit.py https://your-site.example/
  3. Read the rollup, not just the per-page list. The check with the highest β€œN/total failing” is your first fix β€” almost always a template, not a page.
$ python3 tools/site_audit.py https://acme.test/

Site audit: https://acme.test/ (3 pages)
──────────────────────────────────────────────
[PASS] clean      Β· https://acme.test/
[WARN] 2 issue(s) Β· https://acme.test/pricing
[WARN] 6 issue(s) Β· https://acme.test/blog/x
──────────────────────────────────────────────
VERDICT: 2/3 pages have at least one gate failing β€” fix worst first.
$ # …findings rolled up across the site
[PASS] CRAWL    Β· page returns success        all pages OK
[WARN] INDEX    Β· no noindex directive        1/3 pages failing
[WARN] INDEX    Β· has a <title>               1/3 pages failing
[WARN] INDEX    Β· declares a canonical        2/3 pages failing
[WARN] INDEX    Β· exactly one <h1>            1/3 pages failing
[WARN] INDEX    Β· has JSON-LD structured data 2/3 pages failing
[WARN] INDEX    Β· enough body text            1/3 pages failing
──────────────────────────────────────────────
VERDICT: canonical + JSON-LD fail on 2/3 β€” one template fix clears both.

Ceilings to know: this is a static fetch (no browser), so JS-rendered content is invisible β€” the render gap from Lesson 0006. It walks only links it can find (orphan pages need your sitemap). And it audits machine-checkable signals β€” not whether the writing is any good. The map is not the territory; it just tells you which gates are shut.

Retrieval practice Β· no peeking

Audit the whole site, not one page

Answer from memory β€” that effort is what makes it stick. One try each; pick before you read the others.

Question 1 / 4
How does the auditor decide which pages to check?
Question 2 / 4
Why does the crawl skip links to other-site.com?
Question 3 / 4
The rollup says canonical β€” 2/3 pages failing. The builder fix is usually to…
Question 4 / 4
The audit fetches each page with urllib (no browser). What can it miss?
Primary source β€” read this next (β‰ˆ15 min)
β€œSEO Starter Guide: The Basics” β€” Google Search Central

Google's own end-to-end list of what a site should get right β€” the same gates this capstone audits, from the source that defines them.

Stuck or curious? This agent is your teacher. Ask it anything β€” β€œshow me a real robots.txt”, β€œdo Claude and Perplexity retrieve differently?” β€” followups are the fastest way to learn.