Claude Code Extended Thinking: It’s a Summary, Not the Truth

Claude Code's extended thinking is a summary, not the actual reasoning. Here's what that means for your debugging - and 3 workarounds that work.

Jamie Lin2026-06-228 min readBeginner

So is what you see in Claude Code’s extended thinking pane the model’s actual reasoning, or a polished retelling? That’s the question blowing up on Hacker News after a blog post arguing the Ctrl+O view is a summary, not the real chain of thought. The comments got loud fast. People who’d been quoting those traces in PR reviews, bug reports, and audit logs suddenly weren’t sure what they’d been quoting.

Here’s the short answer: they’re right. And it matters more than the hot-take headlines made it sound – but probably not in the way most people assumed. This tutorial walks through what’s actually happening, how to verify it on your own machine, and how to adjust your Claude Code workflow so you stop treating the summary like a transcript.

What Claude Code’s extended thinking actually returns

The trending claim isn’t a leak or a reverse-engineered secret. It’s right there in Anthropic’s own documentation, written in cautious language that’s easy to skim past. Per the official docs, with extended thinking enabled, the Messages API for Claude 4 models (as of mid-2025) returns a summary of the full thinking process. Not the process itself.

The mechanics get weirder when you read the next paragraph. Summarization is processed by a different model than the one you target in your requests, and the thinking model never sees the summary. So what you read in Ctrl+O is one model’s recap of another model’s private reasoning – and the reasoning model has no idea what’s being said on its behalf.

The original Anthropic launch pitch was different. In the February 2025 announcement, the team said they’d deliberately chosen to make Claude’s thought process visible in raw form – for trust, for alignment research, and because it was “fascinating to watch Claude think.” That was Claude 3.7 Sonnet. With Claude 4 (and especially Opus 4.6 and Sonnet 4.6, as of mid-2025), the default shifted to summarized output, and getting the raw stream now requires contacting Anthropic sales.

Prove it to yourself in three minutes

This is the part most posts skip. If you’re on a Max or Pro plan running Claude Code locally, you can verify the gap with no MITM proxy, no dev tools – just your shell.

Open a fresh Claude Code session and ask it something non-trivial: “review this function for SQL injection and explain your reasoning step by step.”
Hit Ctrl+O and read what shows up in the transcript view.
Now look at the raw session file. They live in ~/.claude/projects/<your-project>/. Open the most recent .jsonl file.
Search for "type": "thinking". You’ll find the thinking blocks – each with a thinking text field and an opaque signature field.

On Sonnet 4.6 you’ll typically see a populated summary. On Opus 4.6, community testing has caught a stranger pattern: blocks where the thinking field is empty but the signature is valid – proof the model reasoned, proof you got billed for it, and zero content delivered. (GitHub issue #31143 tracks this.)

Pro tip: Diff what Ctrl+O shows you against the raw .jsonl after every meaningful session. If you’re keeping the trace as part of a PR review or compliance log, the file is the artifact – not the terminal scrollback. Terminal output has its own bugs too: on Claude Code 2.1.119 (as of mid-2025), the Ctrl+O transcript view doesn’t tail new content, so it’s literally a frozen snapshot until you toggle it off.

The billing detail that surprised even seasoned users

Here’s where the summary-vs-real-thinking debate stops being philosophical and starts costing money. You pay for the hidden tokens, not the ones you read.

What you see in the response	What you actually pay for
The summarized thinking text	The full internal reasoning the model generated – can be substantially larger on complex tasks
The final answer tokens	The final answer tokens

The docs are direct about it: billed output tokens won’t match the visible count in the response. Community proxy logs confirm the gap exists – exact ratios depend on task complexity, but the API doesn’t return a separate thinking_tokens field; it’s all lumped into output_tokens. You’re flying partially blind on cost.

When your Max plan limits evaporate on a Tuesday afternoon, part of the missing quota is invisible reasoning you can’t audit, sample, or even count.

How to adjust your workflow today

The quality boost from extended thinking is real – that’s worth saying upfront. But the workflow most tutorials teach – “hit Ctrl+O, watch Claude think, intervene with Ctrl+C if you see it going wrong” – is built on a premise that no longer holds for Claude 4 models. You’re not watching it think. You’re reading a recap written by a separate model that your target model never saw.

Three concrete changes worth making:

Stop quoting thinking traces as evidence. If you’ve been pasting Ctrl+O output into PR descriptions or post-mortems to explain why Claude did X, label it for what it is: a summary by a separate model. The original reasoning is encrypted in the signature and you don’t have the key.
Use the trace as a smoke alarm, not a transcript. The summary is still useful for catching obviously wrong directions early – if the recap says “I’ll assume the database is Postgres” and you’re on MySQL, that’s a real signal. Just don’t expect it to faithfully report every tool call or branch the model considered.
If you need true auditability, escalate the plan. Per Anthropic’s docs, full thinking output for Claude 4 models is available only through an enterprise sales conversation. Regulated industries running Claude in production should probably already be having that conversation.

Common pitfalls

The biggest one: assuming “ultrathink” or high effort gives you more visible thinking. It gives the model more internal budget, but the summarizer still compresses what you see. More tokens burned, not proportionally more transparency.

Second one, subtler: people round-tripping conversations through the API and filtering content blocks on block.type == "thinking". The docs warn that this silently drops redacted_thinking blocks and breaks the multi-turn protocol. If you’ve got a script doing this, multi-turn tool-use sessions will behave inconsistently and you won’t know why.

Third one: trusting the summary in interleaved-thinking-plus-tool-use sessions. A model can call functions during the hidden reasoning phase, and the summary may not surface every call. That’s the angle the Hacker News commenters got most worked up about – and fairly. If an agent has shell access and you’re reading a tidy recap, the recap is not your security audit.

How this compares to OpenAI and DeepSeek

DeepSeek’s R1 exposes the full chain-of-thought as plain text in the response – no summarizer, no encryption, no enterprise gate. That’s the transparent end of the spectrum. OpenAI’s o-series models are at the other end: reasoning tokens hidden completely by default, billed via a separate labeled field on the API response, no summary offered unless you request one. Claude sits between the two – more visible than OpenAI’s approach, but the visibility comes with an asterisk the size of a summarizer model.

That middle position is the source of the confusion. The feature looks transparent, which makes the gap between appearance and reality feel like a bait-and-switch – even though the docs spell it out. OpenAI never implied you’d see the reasoning. Claude did, in February 2025, and then quietly changed what “seeing” meant.

FAQ

Is Anthropic hiding the real thinking on purpose?

Yes – and they say why in the docs: summarization “prevents misuse,” mostly meaning it makes it harder to distill competitor models from raw reasoning traces. Whether that justification holds up is a separate debate.

Does this mean extended thinking is useless for debugging?

No, just narrower than people think. Imagine you’re debugging why Claude rewrote a function the wrong way. The summary will usually tell you the high-level wrong assumption it made – “I treated this as async when it’s sync” – and that’s enough to correct course on the next turn. What the summary won’t reliably tell you is the exact order of tool calls, the alternatives it considered and discarded, or whether it second-guessed itself mid-way. For root-cause analysis at that depth, you’d need the raw thinking, which means an enterprise agreement.

Should I keep using ultrathink / high effort?

Use it for hard architecture decisions and complex multi-file debugging. For everything else, medium effort is fine – and watch your session settings. A forgotten high-effort configuration persists across sessions and quietly drains your quota for days before you notice.

Try this next: open your most recent Claude Code session file in ~/.claude/projects/, grep for "type": "thinking", and compare the contents to whatever you remember from the terminal. That five-minute exercise will recalibrate how much weight you give the Ctrl+O view from now on.

What Claude Code’s extended thinking actually returns

Prove it to yourself in three minutes

The billing detail that surprised even seasoned users

How to adjust your workflow today

Common pitfalls

How this compares to OpenAI and DeepSeek

FAQ

Is Anthropic hiding the real thinking on purpose?

Does this mean extended thinking is useless for debugging?

Should I keep using ultrathink / high effort?

Related Tutorials

RubyLLM Tutorial: One Gem for Every AI Provider (2026)

The Microsoft Quantum Python Bug: A Beginner’s Lesson in Index vs Value

Recall for Claude Code: Local Project Memory Guide