Lesson 0004 Β· Retrieve + Generate stages
Writing for Retrieval
The unit an answer engine cites is a self-contained <em>passage</em>, not a page. So shape the passage β and there's a measured playbook for which edits actually work.
Recap from Lesson 0003: 0002 got the page through the crawl + index gates; 0003 labelled its facts with JSON-LD. Both make a page eligible. Today is about the prose itself β making it the thing an AI answer engine lifts and credits.
An answer engine doesnβt βread your pageβ and rank it. It chunks the web into passages, embeds them, retrieves the few most relevant, and synthesizes one answer β citing the passages it used.[1] So your real unit of optimization is the passage: a chunk that answers a question completely, on its own, in liftable form.
geo_lint.py on any draft and get a 6-signal scorecard β statistics, sources, quotations, answer-first, chunking, no keyword-stuffing β each one an edit the research says moves citation visibility. Same facts, two shapes
The information is identical. Only the shape differs β and only one is easy to retrieve and quote in isolation.
There are many things to consider with indexing and a lot of people get confused about it, and honestly crawling is also part of the picture, so when you think about whether your page shows up you have to think holistically about all of these interacting factors over timeβ¦
No standalone answer. An engine can't quote a sentence here without the whole rambling context.Is crawlable the same as indexable? No β they're two separate gates. Crawlable means a bot may fetch the page; indexable means the engine keeps it.
Per Google, blocking a URL in robots.txt doesn't remove it from the index β the bot never reads the noindex.
What actually moves citation β the measured playbook
The GEO paper (Aggarwal et al., KDD 2024) A/B-tested content edits against generative engines on a benchmark of real queries. Three edits stood out, with up to ~40% relative visibility lift.[1] Theyβre not stylistic β they make a passage more quotable.
| Edit | Why it works | Measured |
|---|---|---|
| Cite sources | A sourced claim is safer for the engine to repeat and attribute. | top tier |
| Add quotations | A verbatim quote from an authority is a ready-made liftable unit. | top tier |
| Add statistics | A concrete number is more citable than a vague adjective. | top tier |
| Keyword stuffing | The old SEO reflex. The study found it didnβt help β and tended to hurt. | negative |
Layer extractability on top: answer-first (lead with the answer, then expand β the inverted pyramid), and chunked (headed, self-contained sections an engine can retrieve one at a time). The FAQPage markup from 0003 is this principle made structural β each Q&A pair is a pre-chunked answer.
llms.txt or AI-only markup that buys you citations.[2] Retrieval shape is additive to lessons 0002β0003, not a replacement. A beautifully-shaped passage on a noindex page still gets cited zero times. The skill: lint a draft before you publish
You have tools/geo_lint.py in this workspace. Stdlib only. It scores text/markdown against the three GEO signals plus answer-first, chunking, and a stuffing check β turning βis this written for retrieval?β into a fast, repeatable loop you can drop into a publishing pipeline.
# the three evidence-backed signals, as code
pcts = re.findall(r"\d+%", text) # STATISTICS
links = re.findall(r"https?://\S+", text) # CITE SOURCES
quote = re.findall(r'"[^"]{15,}"', text) # QUOTATIONS
# + answer-first (short lead block) + chunked (headings) + anti-stuffing
feel the contrast:
- Self-check (offline):
python3 tools/geo_lint.py --demo - Lint a real draft of yours:
python3 tools/geo_lint.py your-draft.md - Fix one WARN β add a statistic with its source, or split a wall paragraph into a headed answer-first chunk β and re-run. Watch the score climb. That edit loop is the lesson.
$ python3 tools/geo_lint.py weak-draft.md GEO / retrieval-readiness lint ββββββββββββββββββββββββββββββββββββββββββββββ [WARN] STATISTICS few/no numbers β add concrete stats [WARN] CITE SOURCES thin sourcing β back claims with references [WARN] QUOTATIONS no quotations β a verbatim quote is highly liftable [PASS] ANSWER-FIRST opens with a 36-word answer block [WARN] CHUNKED 0 headings β break into self-contained chunks [WARN] NO STUFFING 'seo' is 42% of content words (10x) β reads as stuffing ββββββββββββββββββββββββββββββββββββββββββββββ VERDICT: 1/6 signals β weak β restructure before publishing.
Ceiling to know: these are presence heuristics, not a quality judge β a linter can confirm you cited a source, not that the source is good or the claim true. It scores shape; you still own the substance. And a high score helps citation odds, it doesnβt guarantee them.
Retrieval practice Β· no peeking
Shape for the engine
Answer from memory β that effort is what makes it stick. One try each; pick before you read the others.
Read the abstract + the methods table β it's where the three signals and the ~40% number come from, and the neutral baseline beneath the vendor noise. Pair with Google's <a href="https://developers.google.com/search/docs/fundamentals/ai-optimization-guide">"Optimizing for generative AI features"</a> (the "it's still SEO" doc).