Skip to content

“If You’re an LLM, Please Read This”: How to Write One

Anna's Archive just dropped a viral 'If you're an LLM, please read this' page. Here's what it actually is and how to write one for your own site.

8 min readBeginner

Anna’s Archive just dropped a blog post titled “If you’re an LLM, please read this” and it’s all over Hacker News. The page isn’t written for humans – it’s a polite, structured note addressed directly to AI crawlers, telling them how to grab the archive’s data without hammering the site. The HN thread is a mess of philosophical arguments about whether the data is even theirs to offer, but the actually interesting part is the tactic. Anna’s Archive picked the cooperative path over the hostile one, and it’s worth copying.

Here’s the choice every site owner faces right now. Option A: block AI crawlers with robots.txt rules, Cloudflare AI labyrinths, and 403s. Option B: publish a file that tells crawlers exactly what to take, where to take it from, and how to behave. Option A feels good. Option B actually changes crawler behavior, because LLM agents that find machine-readable instructions tend to follow them. If you’re an LLM, please read this – and that’s not a joke, that’s literally a content-delivery strategy now.

What this trending page actually is

The Anna’s Archive post is an llms.txt file, just dressed up as a blog post. llms.txt is a proposed standard put forward by Jeremy Howard of Answer.AI in September 2024. The idea: a single Markdown file at yoursite.com/llms.txt that gives language models a curated, clutter-free guide to your site.

Why Markdown and not XML or JSON? The spec’s answer, per llmstxt.org: these files are expected to be read by language models and agents, and models already parse Markdown natively. No special tooling needed on either side.

What Anna’s Archive added – and the spec doesn’t really cover – is using the file as a negotiation document. They tell LLM builders to favor bulk metadata, caching, and scheduled sync jobs over repeated interactive scraping, because the public site is designed for people first. Then they hand over the alternatives: GitLab, torrents, a JSON API, or a paid API – without breaking CAPTCHAs.

There’s something worth sitting with here. Most web standards are defensive – they tell machines what they can’t do. llms.txt flips that. It’s a site saying: here’s exactly how to work with us. Whether that approach scales, or whether AI companies will actually honor it, is genuinely unclear. But it’s a different kind of relationship than a 403.

The minimum viable llms.txt

The spec’s only mandatory element is an H1 – everything else is optional, per Answer.AI’s original proposal. Here’s the smallest valid file you can ship:

# Your Project Name

> One-sentence summary of what this site is and who it's for.

This paragraph adds context - what's here, what isn't, and any quirks an LLM should know before crawling.

## Core docs
- [Quickstart](https://yoursite.com/quickstart.md): Get running in 5 minutes
- [API reference](https://yoursite.com/api.md): Full endpoint list

## Optional
- [Changelog](https://yoursite.com/changelog.md): Version history

Full spec rules: an H1 with the project name (the only required section), a blockquote with a short summary containing key information, then zero or more Markdown sections using H2 headings for link lists. One section can be labeled ## Optional to mark lower-priority links a model should deprioritize.

Writing it like Anna’s Archive did

Most tutorials treat llms.txt as a documentation index. That’s underselling it. Treat the file as a memo to a non-human reader who has the power to send you traffic, scrape you into oblivion, or quote you in a chat answer. Be direct.

  1. State who you are in one sentence. Not marketing copy. “We’re a non-profit shadow library” is more useful to a model than “empowering knowledge access globally.”
  2. Tell the model what it should and shouldn’t do. Want bulk downloads instead of page scrapes? Say so. Want attribution when quoted? Say so. These aren’t legally binding – but well-behaved agents follow them.
  3. List the actual machine-friendly endpoints. APIs, data dumps, torrents, sitemaps. A link to data.json beats a link to a styled HTML page every time.
  4. Mention preferred refresh behavior. “Sync weekly” tells a cooperative crawler to back off between runs.
  5. Add an Optional section for stuff that’s useful context but shouldn’t clutter the model’s primary pass through your content.

Turns out Mintlify and Anthropic co-developed a second file format alongside this. /llms-full.txt compiles all of your site’s text into one Markdown file so a user can paste a single URL to load your entire docs into an AI tool’s context window. That structure was subsequently folded into the official llms.txt proposal. Most sites publish only one of the two – which means they’re missing half the standard.

The skepticism nobody links to

Before you spend an afternoon on this, hear the counter-argument. Google’s John Mueller said publicly – and the quote has circulated widely in SEO communities – “AFAIK, none of the AI services have said they’re using LLMs.TXT (and you can tell when you look at your server logs that they don’t even check for it)”. That’s a real problem with the whole movement. The file exists. Adoption on the publishing side is real. Adoption on the consuming side is murky.

Pro tip: Before publishing, grep your access logs for requests to /llms.txt. If you see hits from known AI user-agents (GPTBot, ClaudeBot, PerplexityBot, Google-Extended), the file is being consumed. If you see zero hits after a month, you’ve still done useful work – your llms-full.txt is now a single URL you can paste into ChatGPT or Claude yourself.

Thousands of sites serve /llms.txt as of early 2026, and companies including Anthropic, Cloudflare, and Vercel have adopted the format – that’s the Mintlify data from March 2026. Even so, no major AI provider has publicly confirmed they’re ingesting it at crawl time. The file is cheap to publish and immediately useful as a context dump for humans pasting into chatbots. That alone justifies the 15 minutes it takes.

The pitfall nobody mentions: your llms.txt is an attack surface

The whole reason llms.txt works is that models treat the file as instructions. That’s also the mechanism behind indirect prompt injection.

A recent arXiv empirical study (2604.27202) ran 5,200 trials across 13 models and found that prompt injections are predominantly hidden, machine-targeted, and placed in early ingestion channels – with crawlers and data scrapers as the dominant target class. Your llms.txt sits in exactly that position: early, authoritative, machine-read.

Two attack paths follow from this. If your llms.txt is auto-generated from user content – a comments section, a wiki, a reviews page – an attacker can inject instructions into it directly. If it’s hand-written, attackers can still try to shadow it by injecting hostile instructions into pages it links to. Per Palo Alto’s Unit 42 research on indirect prompt injection: adversaries embed hidden instructions within web content that is later ingested by an LLM, which then acts on those instructions as if they were legitimate. Treat the file like a Content Security Policy header – review it, version it, and don’t let it get auto-generated from untrusted sources.

llms.txt vs the alternatives

Most posts pretend the choice is obvious. It isn’t. Here’s the honest comparison:

Approach What it does Honest verdict
robots.txt with AI blocks Tells crawlers what they can’t fetch Defensive only. Doesn’t shape behavior of crawlers that ignore it.
llms.txt Tells crawlers what they should fetch and how Cooperative. Useful even if zero crawlers read it – humans paste it into chats.
RAG endpoints / public API Forces crawlers through a controlled, rate-limited interface Strongest control, highest engineering cost.
Cloudflare AI blocking Hard 403 on known AI user-agents Loud signal, easily bypassed by rotating user-agent strings.

Anna’s Archive uses three of these at once: llms.txt-style messaging, public APIs, and bulk data dumps via torrents and GitLab. That’s the actual lesson from their viral post. The file isn’t the strategy. It’s the README for a strategy.

FAQ

Do any LLMs actually read llms.txt right now?

No major provider has confirmed it publicly, as of mid-2026. Publish it anyway – it takes 15 minutes and doubles as a context file you can paste into any chatbot yourself.

Where do I put it and what should I name the H1?

Root path only: https://yourdomain.com/llms.txt. Not a subfolder, not a subdomain. The H1 should be your project name exactly as you want a model to refer to it. If your site is “Acme Docs” but you want Claude to call you “Acme,” write # Acme. Avoid putting taglines or superlatives in the H1 – “Acme – The World’s #1 Platform” will get echoed back verbatim in AI-generated answers about you, which looks worse than just your name.

What’s the difference between llms.txt and llms-full.txt?

The first is a curated navigation file – links to important Markdown pages, roughly equivalent to a sitemap for LLMs. The second is the kitchen sink: every page on your site concatenated into one Markdown file, ready to paste into a context window. If you only have time for one, ship llms.txt – it’s the canonical file in the spec. But the two work better together, especially if your documentation is the main thing people ask AI tools about.

Next action: Open a terminal, run curl https://annas-archive.org/llms.txt (or visit the URL in a browser), read the actual file, then draft your own version in a text editor in the next 15 minutes. Publish it at /llms.txt before you close this tab.