You’re paying Claude to say “I’d be happy to help you with that.” That preamble costs 8 tokens. The actual answer? Another 12. The ratio is backwards.
A Reddit user just proved you can cut Claude’s output tokens by 75% by making it talk like a caveman. Normal Claude uses ~180 tokens for a web search task. Caveman Claude? 45 tokens. Same result.
This week, that meme became a production tool. The “caveman” Claude Code skill is blowing up on GitHub, Hacker News, and developer Twitter. But what nobody’s saying: it doesn’t work the way you think.
The Output Token Trap: Why Most LLM Costs Hide in the Wrong Place
Every Claude response has two token buckets: thinking tokens (internal reasoning) and output tokens (what you see). Caveman compression only touches output – thinking and reasoning stay untouched.
This matters. A March 2026 paper found that forcing large models to give brief responses improved accuracy by 26 percentage points on certain benchmarks. Brevity didn’t make Claude dumber. Made it focus.
The mechanics: Claude processes your question with full reasoning depth. Then strips filler from the answer. “The reason your React component is re-rendering is likely because you’re creating a new object reference on each render cycle” → “New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo.”
Technical substance? Identical. Token count: 69 versus 19. That’s a 72% drop.
When Caveman Mode Fails (and Nobody Tells You)
Agentic workflows break the savings model.
One developer ran caveman mode in a real Claude Code session with tool use. Minimal savings. In agentic workflows, agent communication turns far outweigh the final output – 75% reduction becomes irrelevant when agent turns dominate total token spend.
Think of it this way: Claude makes 8 tool calls (file reads, searches, executions) before answering. Each tool call = input token event. Your final compressed answer saves 100 output tokens. But the session already burned 15,000 tokens on input. Percentage savings on total cost? Under 1%.
Caveman mode: single-shot responses (code review, function summaries, error explanations). Struggles: workflows with depth.
Pro tip: Pair caveman mode with Claude’s prompt caching. Cache reads cost 10% of standard input price – caching pays off after just one cache read for the 5-minute duration. Compress the output, cache the input. That’s where real savings stack.
Install Caveman Mode in Under 60 Seconds
Fastest route: prebuilt skill. Terminal:
npx skills add JuliusBrussee/caveman
Done. Installs globally across all Claude Code sessions.
Activate it mid-conversation:
- Type
/cavemanor$caveman - Say “caveman mode” or “less tokens please”
- Persists until you type “stop caveman” or “normal mode”
Three intensity levels: lite (drop filler, keep grammar), full (drop articles, use fragments), ultra (maximum compression, telegraphic). Default = full. Switch: “caveman lite” or “caveman ultra.”
Manual control without the skill? Add this to your system prompt:
You are in caveman mode. Rules:
- Drop articles (a, an, the)
- Drop filler (just, really, actually, simply)
- Drop pleasantries (sure, certainly, happy to)
- Short synonyms only (fix not "implement solution")
- No hedging (skip "might be worth considering")
- Fragments fine. Technical terms stay exact.
- Code blocks unchanged. Caveman speak around code, not in it.
Save as a Claude Code custom output style. Toggle on demand. (Or just paste it into the web UI – works there too, minus the three intensity levels.)
Real Benchmarks: What 75% Actually Means
Actual token savings: 22-87% across different prompts. The 75% figure is real. It’s an average, not a guarantee.
Highest compression tasks:
- Function explanations: 80-87% savings. Claude normally writes a paragraph. Caveman gives you two sentences.
- Error debugging: 70-75%. Drops the “I see the issue” preamble, jumps to the fix.
- Code reviews: 65-70%. Strips reasoning narrative, keeps findings.
Lowest compression tasks:
- Short refactors: 22-30%. Not much filler to remove in a 40-token response.
- Yes/no questions: 25-35%. Already brief. Caveman just removes “Yes,”.
Financial impact depends on volume. Claude 3.5 Sonnet pricing (as of early 2026): $3 input / $15 output per million tokens. Running 10,000 API calls/day with 200 avg output tokens costs ~$30/day. Caveman mode at 70% reduction: $9/day. ~$7,665 annual savings.
| Scenario | Daily Output Tokens | Standard Cost | Caveman Cost (70% reduction) | Annual Savings |
|---|---|---|---|---|
| Light API use | 500K | $7.50 | $2.25 | $1,916 |
| Moderate pipeline | 2M | $30 | $9 | $7,665 |
| Heavy automation | 10M | $150 | $45 | $38,325 |
Input token costs not included. For complete cost modeling, check Anthropic’s official pricing.
The Auto-Clarity Gotcha Nobody Mentions
A feature that changes the ROI calculation: the caveman skill includes Auto-Clarity, which automatically drops compression for security warnings and destructive operations.
Result: if you ask Claude to delete files, modify authentication logic, or perform any high-risk operation, caveman mode turns itself off for that response. You get full, verbose warnings.
Your actual savings fluctuate based on task mix. A session with 30% destructive operations won’t hit 75% average compression. More like 50-55% because Auto-Clarity kicked in.
Most tutorials skip this. The skill isn’t a flat 75% discount. It’s adaptive. And the adaptation is silent.
Caveman + Caching + Batch: The Untested Triple Stack
Theory says you can combine three cost levers: caveman output compression (75%), prompt caching (90% savings on repeated context), and Batch API (50% discount on all tokens for async workloads).
Anthropic’s docs claim caching + batch together can hit 95% savings on eligible workloads. Add caveman? Theoretically 97-98% total reduction.
Catch: nobody has documented a production case running all three. Interaction effects = unknown. Does batch processing interfere with cache hit rates? Does caveman mode affect cache key consistency? Open questions.
If you’re running high-volume pipelines, this is the frontier. Test it, measure it, publish the results. Right now? Pure theory.
How This Compares to Academic Prompt Compression
Caveman: system prompt hack. Microsoft’s LLMLingua: actual compression infrastructure.
LLMLingua-2 achieves 2-5x compression ratios, is 3-6x faster than other methods, reduces latency by 1.6-2.9x. Uses a small language model (GPT-2 or BERT-based) to identify and remove non-essential tokens before sending the prompt to the target LLM.
Key difference: LLMLingua compresses input (your prompts). Caveman compresses output (Claude’s responses). For RAG pipelines stuffing 50,000-token documents into context? LLMLingua. For API automation where responses are verbose? Caveman.
They’re complementary. Use LLMLingua to shrink your context, caveman to shrink Claude’s replies. Combined impact: you’re looking at double-digit percentage reductions in both directions.
Start with This Exact Test
Don’t trust the 75% claim. Measure your own tasks.
Run this:
- Pick your 5 most common Claude tasks. Code review, function summary, debugging – whatever you run daily.
- Run each task in normal mode. Copy response. Count tokens using OpenAI’s tokenizer.
- Activate caveman mode (
/caveman). Run same 5 tasks. Count tokens again. - Calculate actual savings percentage per task.
- Multiply by your monthly API call volume. That’s your real savings estimate.
Average above 60%? Caveman mode is production-ready. Below 40%? Your tasks might already be brief, or Auto-Clarity is intervening too often.
The key metric: dollars saved per month. If that number exceeds the time cost of setup and testing, ship it.
FAQ
Does caveman mode make Claude less accurate?
No. Caveman compression only affects output tokens – thinking and reasoning tokens remain untouched, so Claude’s internal reasoning quality is unchanged. For straightforward tasks (summarizing functions, identifying errors), accuracy is unaffected. For complex or nuanced explanations, compression can occasionally strip out qualifications or context, but the core technical information stays intact. One edge case: if you’re asking Claude to explain a subtle distinction (e.g., “What’s the difference between useMemo and useCallback?”), caveman mode might compress the answer so much that the nuance gets lost. In those cases, toggle back to normal mode for that one question.
Can I use caveman mode with Claude’s web interface, or is it Claude Code only?
Works anywhere. The official skill installs via Claude Code, but the technique is just a system prompt. Add the caveman rules to your conversation: “Drop articles, filler, and pleasantries. Use short synonyms. Fragments fine. Keep technical terms exact.” Claude will adapt. You won’t get the three intensity levels or Auto-Clarity, but you’ll get compression.
If I’m already on a Claude subscription (Pro or Max), does token compression even matter?
Yes. Subscriptions have usage caps that reset on rolling 5-hour windows. One developer tracked 10 billion tokens over 8 months – estimated API cost was $15,000, but Max subscription cost $800 total, a 93% saving. Caveman mode extends how far you get before hitting rate limits. Fewer tokens per response = more responses per window. Translates to more productivity per dollar on fixed-price plans.