How to Control AI Reasoning Effort (And Stop Wasting Tokens)

OpenAI's o3 and GPT-5 models let you tune how hard they think. Here's when to use low, medium, or high reasoning effort - and what changes behind the scenes.

Jack Tom2026-03-258 min readBeginner

You just upgraded to OpenAI’s o3-mini. Your code runs. No errors. But your math bot suddenly can’t solve high school algebra.

The model works – it’s just not thinking anymore.

This is reasoning effort levels. The parameter that made “does it even work?” a trick question.

The Hidden Default That Breaks Upgrades

GPT-5.1 defaults reasoning_effort to none. Zero reasoning. Your reasoning model became a regular GPT.

Migrating from o3-mini or o1? Code that worked last month now silently stops doing the hard thinking you paid for. Microsoft’s docs warn: “When upgrading to gpt-5.1 keep in mind that you may need to update your code to explicitly pass a reasoning_effort level.”

Not a bug. Design choice – GPT-5 is a hybrid that can reason but doesn’t by default. No parameter? You get fast GPT that skips internal chain-of-thought.

from openai import OpenAI

client = OpenAI()

# This DOES reason (o3-mini, as of Jan 2025)
response = client.chat.completions.create(
 model="o3-mini",
 messages=[{"role": "user", "content": "Solve for x: 2x + 5 = 13"}]
)

# This DOES NOT reason (gpt-5.1) unless you add reasoning_effort
response = client.responses.create(
 model="gpt-5.1",
 reasoning={"effort": "medium"}, # ← Without this, no reasoning happens
 input=[{"role": "user", "content": "Solve for x: 2x + 5 = 13"}]
)

Check your migrations. Swapped model names but didn’t add reasoning? You downgraded.

What Reasoning Effort Actually Controls

When you set reasoning_effort, you’re telling the model how many hidden tokens to spend on internal chain-of-thought before it writes the final answer.

Three levels exist across most models (o3-mini, o3, o4-mini, GPT-5 as of early 2025):

Low – Minimal internal reasoning. Fast, cheap, good enough for straightforward problems.
Medium – Balanced. Default for ChatGPT (when reasoning is enabled). Suitable for most coding, math, technical questions.
High – Maximum thinking. Searches multiple reasoning paths, backtracks, verifies its work. Slower and pricier, but catches edge cases.

Newer models (GPT-5.2+, future releases) add two extremes:

None – Disables reasoning entirely. Behaves like standard GPT.
Xhigh – Even deeper than high. Available only on specific variants like gpt-5.1-codex-max.

This isn’t prompt engineering – it’s compute allocation. OpenAI’s API docs say higher effort “guides the model on how many reasoning tokens to generate before creating a response.”

Those reasoning tokens? Invisible. Don’t appear in message.content. But they show up in usage.completion_tokens and you pay for them.

When to Use Each Level (The Real Answer)

Forget the toy examples.

Use low:

Task is straightforward and context-rich (“Fix this typo in line 34”)
Speed matters (real-time chat, autocomplete)
You’re batch-processing hundreds of requests and cost adds up

Use medium:

You’re not sure (seriously – it’s the Goldilocks default)
Problem needs multiple logical steps but not deep exploration
Code generation, data transformations, moderately complex Q&A

Use high:

Correctness matters more than speed (competitive programming, research-level math)
Problem has subtle gotchas or needs proof-checking
You’ve tried medium and the answer was wrong or incomplete

Community testing found: medium takes ~3x longer than low, high takes ~3x longer than medium. 5-second timeout? High effort might not finish.

Pro tip: (OpenAI’s own guidance) Reasoning models are “like a senior coworker – give them a goal and trust them to work out the details.” That means less hand-holding in your prompt. Skip “think step-by-step” instructions – the model already does that internally when reasoning_effort is set.

The Token Bill You’re Not Seeing

Reasoning tokens count toward your completion_tokens total. They don’t show up in the response. Microsoft’s docs: “These are hidden tokens that aren’t returned as part of the message response content but are used by the model to help generate a final answer.”

This means:

Counting words in the output to estimate cost? You’re undercounting by 2-10x
Caching responses based on visible content? You’re missing the bulk of the work
Set max_tokens based on expected output length? Model might hit the limit during reasoning and never produce an answer

Check usage.completion_tokens_details.reasoning_tokens in the response object. That’s what you’re actually paying for.

response = client.chat.completions.create(
 model="o3-mini",
 messages=[{"role": "user", "content": "Prove the Pythagorean theorem"}],
 # Note: for reasoning models, use max_completion_tokens, not max_tokens
 max_completion_tokens=5000
)

print(f"Visible tokens: {response.usage.completion_tokens - response.usage.completion_tokens_details.reasoning_tokens}")
print(f"Hidden reasoning tokens: {response.usage.completion_tokens_details.reasoning_tokens}")
print(f"Total billed: {response.usage.completion_tokens}")

Early 2025 pricing: o3-mini costs $1.10/$4.40 per million tokens. 63% cheaper than o1-mini, 93% cheaper than o1. But reasoning tokens balloon at high effort – savings shrink fast.

The Verification Trap (API Users Only)

See this error?

{
 "error": {
 "message": "Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization.",
 "type": "invalid_request_error",
 "param": "reasoning.summary",
 "code": "unsupported_value"
 }
}

Some libraries (like LiteLLM) automatically add reasoning.summary when you set reasoning_effort. This parameter needs a verified OpenAI organization – which most individual developers don’t have.

Fix: verify your org (if you’re a business) or manually strip the summary parameter if your library allows it. No workaround – this is an OpenAI policy gate.

One More Thing About Prompting

Standard GPT prompting advice doesn’t all apply here.

You don’t need “Let’s think step by step” or “Explain your reasoning.” The model’s already doing that – invisibly – when reasoning_effort is set. Extra meta-prompts just waste tokens.

You do want to be clear about the end goal. Think: reasoning models are senior coworkers. You say “Build a function that validates email addresses and handles edge cases,” not “First, import regex. Then, define a function. Then, write a pattern for…”

One gotcha: temperature, top_p, frequency_penalty are not supported on reasoning models. Pass them? You’ll get an error. The model controls its own sampling during the reasoning phase.

What Actually Changed (And Why This Matters Now)

Reasoning models flip the traditional LLM script. Instead of pouring all compute into pre-training and hoping the model memorized enough patterns, they allocate compute during inference – letting the model “think longer” when a problem is hard.

An arXiv tutorial on o1 reasoning frames this as moving from System 1 (fast, automatic) to System 2 (deliberate, analytical) thinking – the same distinction psychologists use for human cognition.

Practical result: o3-mini scores 77% on GPQA Diamond (PhD-level science questions) and hits 2073 Elo on Codeforces (competitive programming). Higher than o1 on several benchmarks, despite being smaller and cheaper.

But it only works if you set reasoning_effort. Otherwise? Just a regular LLM with a higher price tag.

Model	Default Reasoning	Supported Levels	Note
o3-mini	medium (ChatGPT)	low, medium, high	Launched Jan 2025, free tier access
o3 / o4-mini	Varies	low, medium, high	Released Apr 2025, full tool access
GPT-5.1+	none	none, minimal, low, medium, high, xhigh	Must explicitly enable reasoning
o1 / o1-mini	medium	low, medium, high (o1 only)	Legacy models, no API control initially

Start Here

ChatGPT users: it’s already handled. Free users get o3-mini at medium effort. Paid users can toggle o3-mini-high in the model picker for tougher problems.

API users: add reasoning={"effort": "medium"} to your client.responses.create() call. Monitor reasoning_tokens in your usage stats for a week. Adjust up or down based on what you see.

Migrating from o1 to GPT-5? Don’t just swap model names. GPT-5.1+ defaults to none. You’ll lose reasoning unless you explicitly set the parameter.

One more thing. OpenAI’s best practice guide says: “Treat reasoning.effort as a tuning knob, not the primary way to recover quality.” Medium isn’t working? Fix your prompt first. Still not working? Bump to high.

Now go check if your reasoning models are actually reasoning.

FAQ

Does higher reasoning effort always give better answers?

Not always. High effort helps when the problem is complex and has multiple valid approaches. But for straightforward questions? Medium often performs just as well and finishes 3x faster. Diminishing returns? You’ve hit the ceiling of what reasoning can fix – look at your prompt or data quality instead.

Can I use reasoning effort with GPT-4 or older models?

No. reasoning_effort only works with models explicitly trained for test-time compute scaling: o1, o1-mini, o3, o3-mini, o4-mini, GPT-5+. Pass it to gpt-4 or gpt-4-turbo? You’ll get an error. Those models don’t have the internal chain-of-thought architecture that makes reasoning levels meaningful. One dev on Reddit tried forcing it via a custom wrapper – API rejected it with “unsupported parameter.”

Why does my response cut off mid-sentence even though I set max_tokens high?

You probably set max_tokens when you should’ve set max_completion_tokens. Common misconception: “completion tokens” means “visible output.” Nope. Reasoning models count hidden reasoning tokens toward the limit. Model spends 8,000 tokens on internal chain-of-thought, you capped it at 10,000 total? You get 2,000 tokens of visible output – sometimes not enough to finish the answer. Use max_completion_tokens (or max_output_tokens in the Responses API) to control the full budget, reasoning included. Had a client debugging this for 3 days before realizing the model was “thinking” perfectly fine – just running out of budget before it could write the conclusion.