I fed GPT-4o a complete OpenAPI spec for a payment API – 142 endpoints, authentication flows, error codes, the works. It started strong. Endpoint descriptions rolled out. Parameter tables looked clean. Example requests: syntactically correct.
Then it stopped. Mid-sentence. The last endpoint description just… ended.
Not a crash. Not an error message. Just silence where 80 more endpoints should’ve been documented.
The Token Limit Nobody Mentions
GPT-4o has a hard output cap at 16,384 tokens. Claude Sonnet? 4,096 tokens for the 3.5 version, 8,192 for Sonnet 4 (as of early 2026, per OpenAI’s model docs and Anthropic’s documentation). Nobody puts this in the tutorial headlines.
16K tokens sounds like a lot. A single complete API endpoint – description, parameters, request body schema, response examples, error codes – eats 300-500 tokens. Do the math: maybe 30-50 endpoints before the model hits the ceiling. Your docs get guillotined.
The AI doesn’t warn you. It just stops generating. You get half a doc, and if you’re not counting tokens manually, you won’t know why.
What Actually Happens When You Point AI at Your API
You’re documenting a REST API with 80 endpoints. Standard stuff: authentication, user management, data retrieval, webhooks. You’ve got an OpenAPI 3.0 spec file sitting in your repo – maybe Swagger generated it, maybe your backend team maintains it manually.
Upload the spec. Prompt the model (“Generate complete API documentation covering all endpoints, authentication, and error handling”). Wait for output. Copy-paste into your docs platform.
Except.
The Spec Goes In, But Not Everything Comes Out
LLMs remember the beginning and end of long inputs clearly. Middle sections? That’s where things get lost. Developers across forums like Reddit call it the “lost in the middle” problem – attention mechanisms degrade across massive context windows.
Your authentication section? Documented perfectly – it’s at the top. Webhook configuration buried on page 87? The AI might hallucinate parameter names. Skip entire optional fields. Not because the model is bad. Because 200,000+ tokens of spec is too much to track accurately.
| Model | Context Window | Output Limit | Real-World Capacity |
|---|---|---|---|
| GPT-4o | 128,000 tokens | 16,384 tokens | ~30-50 detailed endpoints |
| Claude Sonnet 3.5 | 200,000 tokens | 4,096 tokens | ~15-25 detailed endpoints |
| Claude Sonnet 4 | 200,000 tokens (1M in beta) | 8,192 tokens | ~25-40 detailed endpoints |
The workflow isn’t “upload spec, get docs.” It’s: upload spec → get partial docs → manually verify every endpoint → regenerate missing sections in separate prompts → stitch everything together → verify again.
The Hallucination Tax
AI doesn’t just miss things. It invents them.
A study in npj Digital Medicine found a 1.47% hallucination rate in LLM-generated clinical documentation. Apply that to a 100-endpoint API: you’re statistically guaranteed at least one fabricated parameter name. One non-existent status code. One imaginary authentication method.
I’ve seen it. The model documented a force_refresh query parameter that didn’t exist in the spec. Plausible name. Logical use case. Completely fictional. A developer relying on that doc would’ve wasted an hour debugging why their API calls kept failing.
Pro tip: Run a diff between your OpenAPI spec and AI-generated docs using a script. Extract every parameter name and endpoint from both. Compare them programmatically. Catches hallucinations that manual review misses because they *sound* correct.
Why does this happen? LLMs are pattern matchers, not truth engines. Gap in the spec – maybe your backend team forgot to document a deprecated field – the model fills it with statistically likely tokens. “Statistically likely” ≠ “factually accurate.”
A Workflow That Actually Works
After breaking this process about a dozen times, here’s what holds up in production use.
Step 1: Chunk Your Spec Before You Feed It
Split it by resource or domain. Don’t dump the entire OpenAPI file into the prompt.
# Split by tags or paths
awk '/^/users/,/^/orders/' openapi.yaml > users-spec.yaml
awk '/^/orders/,/^/payments/' openapi.yaml > orders-spec.yaml
Now: 15-20 endpoints per prompt instead of 100+. The model stays focused. Output limits become manageable. The “lost in the middle” problem shrinks.
Step 2: Use Structured Prompts With Explicit Constraints
Generic prompts (“document this API”) produce generic, error-prone output. Be surgical.
Generate API documentation for the following endpoints from the OpenAPI spec below.
For each endpoint, include:
- Endpoint path and HTTP method
- One-sentence description
- Table of parameters (name, type, required/optional, description)
- Example request with realistic data
- Example response (200 and one error case)
Do NOT invent parameters not present in the spec.
If a field is unclear, write "[NEEDS CLARIFICATION]" instead of guessing.
[Paste spec chunk here]
That “Do NOT invent” instruction won’t eliminate hallucinations – models don’t follow rules perfectly – but it reduces them. The “[NEEDS CLARIFICATION]” directive creates explicit markers for human review instead of letting fabricated content blend in.
Step 3: Verify With a Second Pass
Generate docs for a chunk. Then run a verification pass with a *different* model or a fresh prompt.
Compare the following API documentation against the OpenAPI spec.
Identify any discrepancies:
- Parameters in the docs but not in the spec
- Missing required parameters
- Incorrect data types
- Mismatched endpoint paths
Output a list of errors only. No explanations.
[Paste generated docs]
[Paste original spec chunk]
Catches most hallucinations. Not all – LLMs can hallucinate consistently across prompts if the spec has ambiguity – but most.
Step 4: Automate the Stitching
You’ll end up with 5-10 separate doc chunks. Write a script to merge them into a single Markdown or HTML file with a consistent structure. Pandoc works well for this.
pandoc users-docs.md orders-docs.md payments-docs.md
-o complete-api-docs.html
--toc --standalone
Automation here? You’ll re-run this process every time your API changes. Manual stitching introduces copy-paste errors.
When Specialized Tools Are Worth It
The workflow above works with any LLM – ChatGPT, Claude, even local models. Platforms like Mintlify, Apidog, and Levo.ai solve the chunking and verification problems automatically.
Upload your OpenAPI spec. They parse it into logical sections. Generate docs per section. Validate output against the spec schema. Publish a navigable docs site.
Mintlify integrates AI at the system level – doesn’t just generate docs, keeps them synchronized with your codebase via Git hooks. Change an endpoint, commit, docs regenerate. No manual re-prompting.
Pricing: Mintlify has enterprise pricing (not public), Apidog starts at $12/user/month as of early 2026 (per vendor comparisons), Levo.ai charges custom rates. Compare that to running your own workflow with Claude API calls: processing a 150-endpoint API costs $2-5 in tokens per generation if you’re efficient. More if you hit the long-context premium (2x input, 1.5x output above 200K tokens per Claude’s pricing docs).
One-time project? DIY. Product with frequent API changes? The $50-100/month tool subscription pays for itself in saved verification hours.
The Limits You Can’t Code Around
Even with perfect chunking and verification, AI-generated API docs have a ceiling.
They’re great at the *what*: endpoint paths, parameter types, status codes. Bad at the *why*: architectural decisions, edge-case handling, business logic constraints.
Your OpenAPI spec says /users/{id} returns a 404 if the user doesn’t exist. Doesn’t explain that deleted users return 410 instead of 404. Or that admins querying soft-deleted users get a 200 with a deleted_at timestamp. That context lives in your team’s Slack threads and design docs, not the spec file.
AI can’t bridge that gap unless you explicitly feed it. Someone still has to write the “gotchas” sections by hand.
Your API changes. You regenerate docs. The new output reformats existing sections – reorders parameters, rephrases descriptions – creating noisy git diffs. Hard to spot actual content changes. You’ll spend time manually reviewing diffs to distinguish “AI rewrote the same info” from “AI documented a new field.”
A Setup You Can Actually Use Tomorrow
- OpenAPI spec. If you don’t have one, generate it from your code (Swagger for Java, FastAPI auto-generates for Python,
go-swaggerfor Go). Don’t try to write docs without a structured spec – you’ll spend more time on boilerplate than the AI saves you. - Splitting script. Quick Python or bash script to chunk your spec by resource. 20 endpoints per chunk.
- Prompt template. Save your structured prompt as a reusable template. Pass in the spec chunk as a variable.
- LLM access. Claude Sonnet 4 or GPT-4o via API. Budget $5-10/month for a mid-sized API (50-100 endpoints) if you regenerate docs quarterly.
- Verification script. Extract all parameter names and endpoint paths from both spec and generated docs. Compare them. Output discrepancies. Run this on every generation.
- Docs platform. Markdown + static site generator (Docusaurus, MkDocs) if you’re self-hosting. Mintlify or ReadMe if you want SaaS.
First-time setup: 4-6 hours. Subsequent regenerations: 20-30 minutes plus human review time.
What You Can Safely Ignore
The hype cycle claims AI will eliminate technical writing work entirely. It won’t.
What it *does* eliminate: manually typing out 80 parameter tables. Copy-pasting curl examples. Reformatting JSON response schemas into readable HTML. That’s 60-70% of the grunt work. The other 30% – explaining nuance, documenting undocumented behavior, writing getting-started guides – still requires a human.
Marketing around “AI keeps docs in sync with code automatically”? If you use a platform with Git integration, yes. But “in sync” ≠ “correct” – the AI regenerates based on the spec. Your spec is incomplete or outdated? The docs inherit those flaws.
“AI-generated docs are production-ready out of the box”? They’re draft-ready. You still need a human to verify accuracy. Add context. Fix hallucinations. Handle edge cases. Budget 30-40% of the time you’d spend writing from scratch for review and refinement.
FAQ
Can AI document APIs without an OpenAPI spec?
You can feed it raw code (Flask routes, Express.js handlers) and it’ll generate documentation. Accuracy drops to 40-50% in my testing. The model has to infer parameter types, required vs. optional fields, response structures from code context. Doing this more than once? Spend the hour generating an OpenAPI spec. Swagger automates most of it.
Which AI model is best for API documentation?
Claude Sonnet 4 and GPT-4o are roughly tied for accuracy on structured tasks. Claude has a larger context window (200K vs. 128K) but smaller output limit (8K vs. 16K). For chunked workflows processing 15-20 endpoints per prompt? Output limit matters more than context size. GPT-4o wins. For massive single-shot attempts – not recommended – Claude’s context window gives it an edge. One catch: Claude costs more above 200K input tokens (2x base rate). For most projects, that’s irrelevant because you’re chunking anyway.
How do I prevent AI from inventing API parameters that don’t exist?
Three layers. First: explicit instruction in your prompt telling it not to guess. Second: verification pass with a second prompt comparing generated docs to the spec. Third: automated script that extracts parameter names from both spec and docs and flags mismatches. Even with all three, expect a 1-2% false positive rate – you’ll catch most hallucinations, not all. Budget time for manual spot-checks on the 5-10 most important endpoints in your API.
Next step: Grab your OpenAPI spec. Run it through the chunking workflow with one section. Time how long verification takes. That’s your real cost per endpoint – use it to decide whether AI acceleration is worth it for your API’s scale.