How AI Reads Your Website
Search engines are no longer the only machines reading your pages. Here's how AI answer engines consume a site — and how to make sure yours is legible.
For twenty years, “being found online” meant one thing: ranking on Google. That’s no longer the whole game. A growing share of people get their answers from AI — Google’s own AI Overviews, ChatGPT, Perplexity, Claude — without ever clicking a blue link. If those systems can’t read and trust your site, you’re invisible to them, no matter how well you rank.
This is the final part of a three-part series. It assumes the groundwork from Semantic HTML Is a Superpower and Technical SEO for Developers — because, happily, most of what makes a site legible to AI is the same work that makes it good for everything else.
How answer engines actually work
Most AI search features run on retrieval-augmented generation (RAG): the engine retrieves relevant content from an index or fetches it live, then a language model synthesizes that material into an answer — often with citations. Two kinds of crawlers feed this:
- Training crawlers scrape the web to build the corpora that future models learn from.
- Answer / citation crawlers fetch content now to decide whether you get cited in a live answer.
The second kind is the one that affects today’s traffic. And here’s the unsettling part: ranking #1 on Google no longer guarantees you appear in AI answers. Industry analyses suggest the overlap between top organic links and AI-cited sources has fallen sharply — from roughly 70% to under 20% by some measures. AI visibility has become its own discipline, sometimes called GEO (Generative Engine Optimization) or AEO (Answer Engine Optimization). The term entered the literature with the 2023 research paper “GEO: Generative Engine Optimization”, which found that pulling in citations, quotations, and statistics measurably increased a source’s visibility in generated answers.
Appearing in AI answers and ranking on page one are now two different games. You have to play both.
Know your crawlers
You can’t manage what you can’t name. The major AI crawlers identify themselves by user-agent, and you control their access in robots.txt exactly like any other bot:
GPTBot— OpenAI’s training crawler.OAI-SearchBotpowers ChatGPT search;ChatGPT-Userfetches a page live when a user’s prompt references it.ClaudeBot— Anthropic’s crawler.PerplexityBot— Perplexity’s indexer.Google-Extended— arobots.txttoken that controls whether your content trains Gemini and feeds AI experiences, separately from normal Google Search crawling.
# robots.txt — allow the answer crawlers you want citing you
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
The strategic point: blocking these is a real choice with real consequences. Block them and you protect your content from training — but you also opt out of being cited. Decide deliberately; don’t let a copy-pasted Disallow: / make the call for you. (OpenAI documents its crawlers here; Google documents Google-Extended here.)
Making your site legible to AI
The good news: there’s no secret markup. Google has been explicit that no special structured data is required to appear in AI experiences — the fundamentals are the strategy. In rough order of leverage:
- Render content server-side. If your words only exist after JavaScript runs, you’re betting an AI crawler executes it — many don’t reliably. Ship real HTML. (A static site with no client-side rendering step is ideal here, but server rendering of any kind works.)
- Don’t gate the content. Paywalls, login walls, and aggressive bot-blocking make you unreadable. Check that your
robots.txtdoesn’t block the AI crawlers you actually want citing you. - Use semantic HTML. A model extracting your content leans on the same landmarks and heading hierarchy a screen reader does. Clean structure is clean signal.
- Ship structured data. The same JSON-LD that earns rich snippets helps AI verify entities — who you are, what you’re an authority on. The
knowsAboutproperty on your Organization schema is a direct, machine-readable claim of expertise. - Write extractable answers. GEO rewards content that’s easy to lift and recombine: clear questions answered directly, short summaries near the top, real headings, lists, and tables. Bury the answer under 800 words of preamble and a model will skip you for a source that gets to the point.
Authority is a signal models actually use
RAG engines don’t just retrieve content — they weigh how trustworthy it looks before repeating it. That maps closely to the E-E-A-T framework Google uses for human-quality assessment: experience, expertise, authoritativeness, trustworthiness. In practice, that means named authors with real credentials, citations to primary sources, specific facts and figures over vague claims, and content that’s kept current. The original GEO research found these exact moves — adding statistics, quotations, and cited sources — raised visibility in generated answers. It’s not a trick; it’s that the things which make content credible to a careful human also make it citable to a model.
A word on llms.txt
You’ll hear about llms.txt — and it’s widely misunderstood, so let’s be precise. Proposed by Jeremy Howard in September 2024, it is not a robots.txt-style permissions file. It’s a curated markdown file at /llms.txt that gives an LLM a concise, expert-level map of your most important content — an H1, a summary blockquote, and lists of links to clean markdown versions of key pages — sized for a model’s context window at inference time. (The companion convention: serve a .md version of each page at the same path plus .md.)
It’s a genuinely nice idea. But be honest about where it stands in 2026:
- Adoption sits around 10% of sites, and it remains a community proposal, not a ratified standard.
- The major AI crawlers don’t request it in meaningful volume.
- A large 2025 study by SE Ranking — roughly 300,000 domains, covered by Search Engine Journal — found no measurable relationship between having an
llms.txtand how often a domain gets cited in AI answers.
So: low cost, low risk, unproven upside. Worth a few minutes if you enjoy being early; not worth losing sleep over. Spend your effort on the fundamentals above — they help humans, search engines, and AI, which llms.txt alone does not.
The AI-readiness checklist
- Important content is in the server-rendered HTML, not JS-only.
- No paywall/login wall on content you want cited; the AI crawlers you want aren’t blocked in
robots.txt. - Semantic HTML gives the page a clean, extractable structure.
- Structured data, including
knowsAbout, declares your entities and expertise. - Clear authorship and cited, specific, current facts.
- Answers are direct, near the top, and skimmable — headings, lists, tables.
- (Optional) An
llms.txtindex, with realistic expectations.
Notice how little of that is AI-specific. The throughline of this whole series is that there’s no separate “AI website” to build. A site that’s structured for meaning, fast and crawlable, and explicit about what it is — that’s a site humans, Google, and AI can all read. Build it once, build it right, and you’re legible to whatever reads the web next.
Further reading
- Google Search Central — AI features and your website
- llms.txt specification
- SE Ranking — Why brands rely on llms.txt and why it doesn’t (yet) work
- “GEO: Generative Engine Optimization” (research paper)
- Search Engine Land — What is Generative Engine Optimization?
- OpenAI — Bots & crawlers · Google crawlers overview