Lesson 0009 Β· Capstone
The Site Audit
Eight lessons checked one signal on one URL. A real site has hundreds of pages. Crawl it like a bot and run every gate at once.
Recap: Lessons 0002β0008 each took one signal β robots/noindex, title, canonical, structured data, an <h1>, body depth β and checked it on a single URL. Thatβs the unit test. This is the integration test.
A real site is hundreds of pages, and the signal thatβs broken is rarely broken on the page you happened to look at β itβs broken in a template, so itβs broken on every page that uses it. To find that, you canβt audit one URL. You have to crawl the site the way a bot does and run every gate on every page at once.
Crawl like a bot: BFS from a seed
A crawler doesnβt have your sitemap memorized. It starts at one URL, reads the links, queues the ones it hasnβt seen, and repeats β breadth-first β staying on the site. Two rules keep it honest:
- One domain. Follow links whose host matches the seed (via
seolib.domain); skip off-site links β other peopleβs sites arenβt yours to audit. - A cap. Stop after N pages. A polite crawl is bounded; an unbounded one hammers the server youβre trying to help.
# the heart of it β BFS over internal links, capped
host = domain(seed)
seen, out, queue = set(), [], [seed]
while queue and len(out) < max_pages:
url = queue.pop(0)
if url in seen: continue
seen.add(url)
resp = fetch(url, ua="Googlebot")
pg = parse(resp.body) # one seolib parse β every signal
out.append((url, resp, pg))
for href in pg.links:
nxt = urljoin(url, href) # relative β absolute
if domain(nxt) == host and nxt not in seen:
queue.append(nxt) # internal + unseen β enqueue
Every gate, every page
For each crawled page, run the whole checklist β the same predicates the per-lesson tools owned, now in one table:
| Gate | Check | First seen in |
|---|---|---|
| CRAWL | page returns success (2xx/3xx) | crawl_audit.py Β· 0002 |
| INDEX | no noindex (meta or X-Robots-Tag) | 0002 |
| INDEX | has a <title> | 0002 |
| INDEX | declares a canonical | 0002 |
| INDEX | exactly one <h1> | structure Β· 0004 |
| INDEX | has JSON-LD structured data | schema_tool.py Β· 0003 |
| INDEX | enough body text to retrieve a chunk | geo_lint.py Β· 0004 |
Run it
the capstone pulls every prior lesson into one command:
- Self-check offline (crawls a canned 3-page site):
python3 tools/site_audit.py --demo - Audit a small real site you own:
python3 tools/site_audit.py https://your-site.example/ - Read the rollup, not just the per-page list. The check with the highest βN/total failingβ is your first fix β almost always a template, not a page.
$ python3 tools/site_audit.py https://acme.test/ Site audit: https://acme.test/ (3 pages) ββββββββββββββββββββββββββββββββββββββββββββββ [PASS] clean Β· https://acme.test/ [WARN] 2 issue(s) Β· https://acme.test/pricing [WARN] 6 issue(s) Β· https://acme.test/blog/x ββββββββββββββββββββββββββββββββββββββββββββββ VERDICT: 2/3 pages have at least one gate failing β fix worst first.
$ # β¦findings rolled up across the site [PASS] CRAWL Β· page returns success all pages OK [WARN] INDEX Β· no noindex directive 1/3 pages failing [WARN] INDEX Β· has a <title> 1/3 pages failing [WARN] INDEX Β· declares a canonical 2/3 pages failing [WARN] INDEX Β· exactly one <h1> 1/3 pages failing [WARN] INDEX Β· has JSON-LD structured data 2/3 pages failing [WARN] INDEX Β· enough body text 1/3 pages failing ββββββββββββββββββββββββββββββββββββββββββββββ VERDICT: canonical + JSON-LD fail on 2/3 β one template fix clears both.
Ceilings to know: this is a static fetch (no browser), so JS-rendered content is invisible β the render gap from Lesson 0006. It walks only links it can find (orphan pages need your sitemap). And it audits machine-checkable signals β not whether the writing is any good. The map is not the territory; it just tells you which gates are shut.
Retrieval practice Β· no peeking
Audit the whole site, not one page
Answer from memory β that effort is what makes it stick. One try each; pick before you read the others.
other-site.com?canonical β 2/3 pages failing. The builder fix is usually toβ¦urllib (no browser). What can it miss?Google's own end-to-end list of what a site should get right β the same gates this capstone audits, from the source that defines them.