Lesson 0004 · Retrieve + Generate stages

Writing for Retrieval

The unit an answer engine cites is a self-contained <em>passage</em>, not a page. So shape the passage — and there's a measured playbook for which edits actually work.

Recap from Lesson 0003: 0002 got the page through the crawl + index gates; 0003 labelled its facts with JSON-LD. Both make a page eligible. Today is about the prose itself — making it the thing an AI answer engine lifts and credits.

An answer engine doesn’t “read your page” and rank it. It chunks the web into passages, embeds them, retrieves the few most relevant, and synthesizes one answer — citing the passages it used.^[1] So your real unit of optimization is the passage: a chunk that answers a question completely, on its own, in liftable form.

Your win: run geo_lint.py on any draft and get a 6-signal scorecard — statistics, sources, quotations, answer-first, chunking, no keyword-stuffing — each one an edit the research says moves citation visibility.

Same facts, two shapes

The information is identical. Only the shape differs — and only one is easy to retrieve and quote in isolation.

Wall of prose — hard to lift

There are many things to consider with indexing and a lot of people get confused about it, and honestly crawling is also part of the picture, so when you think about whether your page shows up you have to think holistically about all of these interacting factors over time…

No standalone answer. An engine can't quote a sentence here without the whole rambling context.

Answer-first chunk — liftable

Is crawlable the same as indexable? No — they're two separate gates. Crawlable means a bot may fetch the page; indexable means the engine keeps it.

Per Google, blocking a URL in robots.txt doesn't remove it from the index — the bot never reads the noindex.

A complete answer in two sentences, sourced. Cite-ready as-is.

What actually moves citation — the measured playbook

The GEO paper (Aggarwal et al., KDD 2024) A/B-tested content edits against generative engines on a benchmark of real queries. Three edits stood out, with up to ~40% relative visibility lift.^[1] They’re not stylistic — they make a passage more quotable.

Edit	Why it works	Measured
Cite sources	A sourced claim is safer for the engine to repeat and attribute.	top tier
Add quotations	A verbatim quote from an authority is a ready-made liftable unit.	top tier
Add statistics	A concrete number is more citable than a vague adjective.	top tier
Keyword stuffing	The old SEO reflex. The study found it didn’t help — and tended to hurt.	negative

Layer extractability on top: answer-first (lead with the answer, then expand — the inverted pyramid), and chunked (headed, self-contained sections an engine can retrieve one at a time). The FAQPage markup from 0003 is this principle made structural — each Q&A pair is a pre-chunked answer.

It's still SEO, not a magic file. Google is on record: optimizing for AI features is the same work as good SEO — be crawlable, indexable, helpful, sourced. There is no special llms.txt or AI-only markup that buys you citations.^[2] Retrieval shape is additive to lessons 0002–0003, not a replacement. A beautifully-shaped passage on a noindex page still gets cited zero times.

The skill: lint a draft before you publish

You have tools/geo_lint.py in this workspace. Stdlib only. It scores text/markdown against the three GEO signals plus answer-first, chunking, and a stuffing check — turning “is this written for retrieval?” into a fast, repeatable loop you can drop into a publishing pipeline.

# the three evidence-backed signals, as code
pcts  = re.findall(r"\d+%", text)              # STATISTICS
links = re.findall(r"https?://\S+", text)      # CITE SOURCES
quote = re.findall(r'"[^"]{15,}"', text)       # QUOTATIONS
# + answer-first (short lead block) + chunked (headings) + anti-stuffing

Do this now —

feel the contrast:

Self-check (offline): python3 tools/geo_lint.py --demo
Lint a real draft of yours: python3 tools/geo_lint.py your-draft.md
Fix one WARN — add a statistic with its source, or split a wall paragraph into a headed answer-first chunk — and re-run. Watch the score climb. That edit loop is the lesson.

$ python3 tools/geo_lint.py weak-draft.md

GEO / retrieval-readiness lint
──────────────────────────────────────────────
[WARN] STATISTICS    few/no numbers — add concrete stats
[WARN] CITE SOURCES  thin sourcing — back claims with references
[WARN] QUOTATIONS    no quotations — a verbatim quote is highly liftable
[PASS] ANSWER-FIRST  opens with a 36-word answer block
[WARN] CHUNKED       0 headings — break into self-contained chunks
[WARN] NO STUFFING   'seo' is 42% of content words (10x) — reads as stuffing
──────────────────────────────────────────────
VERDICT: 1/6 signals — weak — restructure before publishing.

Ceiling to know: these are presence heuristics, not a quality judge — a linter can confirm you cited a source, not that the source is good or the claim true. It scores shape; you still own the substance. And a high score helps citation odds, it doesn’t guarantee them.

Retrieval practice · no peeking

Shape for the engine

Answer from memory — that effort is what makes it stick. One try each; pick before you read the others.

Question 1 / 4

What is the unit an answer engine actually retrieves and cites?