GPT-5.4 Just Dropped and Everyone’s Wrong About What Matters

OpenAI's GPT-5.4 hit 83% on work tasks and 75% on desktop navigation - beating humans. But the rate limit trap, the 272K surcharge, and the carwash question reveal what the benchmarks don't.

Jack Tom2026-03-2611 min readBeginner

Here’s what you need to know about GPT-5.4: it scored 75% on desktop navigation tasks, beating the human expert baseline of 72.4%. First AI model to do that. Then someone asked it whether to walk or drive 100 meters to a carwash, and it wrote a careful, confident essay explaining why walking makes sense.

Wrong answer. You need the car at the carwash.

Claude got it in one sentence. Gemini called it a trick question. Every other frontier model nailed it. GPT-5.4 Thinking didn’t.

Both facts are true, and that’s the story of this release.

You’re paying for three models whether you know it or not

OpenAI released GPT-5.4 on March 5, 2026, and most tutorials treat it like a single model with a few variants. That’s not what you’re getting.

You’re getting three separate products with different behavior, different rate limits, and wildly different costs.

GPT-5.4 Thinking (ChatGPT interface): shows its reasoning upfront, lets you steer mid-response. Costs $20/month on Plus (80 messages per 3 hours) or $200/month on Pro (unlimited). This is what most people mean when they say “GPT-5.4.”

GPT-5.4 standard (API and Codex): no upfront plan, runs leaner. $2.50 per million input tokens, $15-20 per million output. Good for production systems where you control the prompt.

GPT-5.4 Pro (API and Pro subscription): deeper reasoning, 12x the API cost ($30 input / $180 output per million tokens). For problems where the standard model doesn’t nail it.

The weirdest part? The reasoning effort toggle in the API. Set it to none and GPT-5.4 responds instantly with no chain-of-thought. Set it to xhigh and it thinks for minutes before answering. Same model, same API call, completely different behavior.

Most comparison articles pit “GPT-5.4” against Claude or Gemini. They never mention which version of GPT-5.4. Thinking at xhigh reasoning? Standard at medium? It matters more than the model name.

The 272K trap nobody warns you about

Every article celebrates the 1 million token context window. Whole codebases! Entire books! 750,000 words in one prompt!

Here’s what they don’t tell you.

According to OpenAI’s pricing docs, once your input crosses 272,000 tokens, two things happen: the input token rate doubles from $2.50 to $5.00 per million, and your request counts at 2x against usage limits.

So that “1M context window” isn’t a single pricing tier. It’s two tiers with a hidden threshold.

Worse: independent testing from OpenAIToolsHub found retrieval accuracy drops 15-20% for information placed between 800K and 1M tokens compared to content in the first 200K. The “lost in the middle” problem every long-context model has, just pushed further out.

The practical context limit is around 800K tokens, not 1M. And even if you stay under that, anything over 272K costs double and burns through your rate limits twice as fast.

If you’re on ChatGPT Pro and uploading massive codebases, you’ll hit the invisible wall and wonder why your “unlimited” plan feels capped.

Set it up the way that actually works

If you’re on ChatGPT Plus or Pro, you already have access. Open the model picker at the top of the screen, select GPT-5.4 Thinking. Done.

If you don’t see it yet, the rollout is gradual – check back in a day.

The Auto setting is smarter than it looks. ChatGPT routes simple queries to GPT-5.3 Instant (fast, cheap) and complex ones to GPT-5.4 Thinking. You don’t pay the reasoning tax on “What’s the weather?” but you get the deep model when you ask it to debug a 500-line function.

Pro tip: Use the thinking-time toggle. Low effort = quick answers at GPT-5.2 speed. High effort = deep analysis that takes 20-30 seconds but catches edge cases the fast mode misses. Most tutorials skip this. It’s the most useful control you have.

For API users, the setup is a single model ID swap. Change gpt-5.2 to gpt-5.4 in your calls. But don’t assume it’s a drop-in replacement.

Two things to adjust:

Reasoning effort defaults. The API parameter reasoning.effort controls how much compute the model spends thinking before responding. Options: none, low, medium, high, xhigh. Test your existing prompts at different levels – you might get the same quality at medium that you were paying xhigh for on GPT-5.2.
Output token budgets. GPT-5.4 is 47% more token-efficient on complex tasks than GPT-5.2 (per OpenAI’s internal benchmarks). Lower your max_completion_tokens and save on output costs. Most prompts that needed 2000 tokens on GPT-5.2 finish in 1000-1200 on GPT-5.4.

One more thing: GPT-5.4 loves bullet lists. It defaults to structured formatting and will nest bullets even when you don’t want them. Add this to your system prompt if you’re using it for prose:

Never use nested bullets. Keep lists flat (single level). If you need hierarchy, split into separate sections.

That’s from OpenAI’s own prompt guidance docs. They know it’s a problem.

Where it actually wins

The benchmarks everyone repeats – 83% on knowledge work, 75% on desktop tasks – are real. But they don’t tell you what the model is good at in practice.

After testing it for two weeks, here’s where GPT-5.4 genuinely beats the competition.

Spreadsheet modeling. On OpenAI’s internal investment banking benchmark, GPT-5.4 scored 87.3% versus 68.4% for GPT-5.2. It builds financial models, handles nested formulas, and catches errors in existing sheets better than any other model. If you’re doing quantitative work, this is the one.

Multi-file code navigation. The 1M context lets you drop an entire codebase into one prompt and ask “Where is the bug that’s causing this API timeout?” It won’t always find it, but it understands relationships across files better than chunking + RAG setups. Claude still writes better code, but GPT-5.4 reads code more accurately.

Tool orchestration. The tool search feature is quietly brilliant. When you give GPT-5.4 access to 30+ tools (MCP servers, APIs, custom functions), it doesn’t load all 30 definitions into the prompt upfront. It searches and pulls only what it needs. Testing on Scale’s benchmark: 47% fewer tokens, same accuracy. For agent workflows, this is a cost breakthrough.

Desktop automation. Native computer use means GPT-5.4 can take screenshots, click buttons, fill forms, navigate UIs – without you writing Selenium scripts. It’s the first general-purpose model where this isn’t bolted on. OSWorld score: 75% (humans: 72.4%). The benchmark tests real desktop tasks: open an app, find a file, edit a document, close it. GPT-5.4 beat human experts.

That last one is the real headline, and most articles bury it.

Where it falls apart (and what that means)

The carwash question isn’t a fluke.

GPT-5.4 Thinking failed basic logic puzzles that Claude, Gemini, and every other frontier model solved. Not because it’s dumber – because it over-reasons. It writes thorough, confident, wrong answers.

Nate’s Newsletter ran a small eval suite in March 2026 comparing GPT-5.4 to Claude Opus 4.6 and Gemini 3.1 Pro. GPT-5.4 won on quantitative modeling, file processing, and competitive benchmark tasks. It lost on writing quality, product judgment, and what he called “the pipeline problem”: knowing when to stop thinking.

Another edge case from the wild: GitHub issue #13609 reports that GPT-5.4 burns through rate limits significantly faster than GPT-5.3-Codex, even for identical workloads with no agents running. Pro users hitting usage caps on workflows that never hit caps before. OpenAI hasn’t commented.

Then there’s the free tier confusion. In mid-March, free users were locked out of GPT-5.4 and GPT-5.3-Codex in the Codex CLI with an error: The 'gpt-5.4' model is not supported when using Codex with a ChatGPT account. The official site still said “Try with Free.” Community forum threads asked if it was a downgrade or a bug. No official response yet. As of late March, it’s still unclear whether free access was intentional or a mistake.

What does this tell you?

GPT-5.4 is production-ready for tasks with clear success criteria (spreadsheets, API calls, structured data extraction). It’s not reliable for tasks that require judgment calls or knowing when “good enough” is better than “perfectly reasoned.”

The real competition isn’t what you think

Every tutorial compares GPT-5.4 to Claude Opus 4.6 and Gemini 3.1 Pro.

Model	Strength	Weakness	Best for
GPT-5.4	Knowledge work, spreadsheets, tool orchestration, 1M context	Over-reasons simple logic, rate limit burns, formatting quirks	Quantitative analysis, desktop automation, multi-tool agents
Claude Opus 4.6	Coding quality, instruction following, multi-file refactoring	Smaller context (200K), higher API cost ($5/$25 per MTok)	Software engineering, complex codebases, agentic coding
Gemini 3.1 Pro	Multimodal (audio/video native), cheapest ($2/$8 per MTok), PhD-level science	Less mature tool ecosystem, fewer integrations	Research, multimedia workflows, budget-conscious production

But the real competition is internal: GPT-5.4 Thinking at xhigh reasoning versus GPT-5.4 standard at medium reasoning. Same model, 5x the cost, 10x the latency, marginally better results.

Most teams should default to standard at medium. Use Thinking + xhigh only when the task genuinely needs it (contract analysis, multi-step debugging, research synthesis). Otherwise you’re burning budget on reasoning you don’t need.

What nobody talks about: access tiers

If you’re on ChatGPT Plus ($20/month), you get 80 GPT-5.4 messages per 3 hours. Sounds like a lot. It’s not.

One conversation with multiple file uploads, a few follow-ups, maybe some web search – you’ve burned through 15-20 messages. Do that twice in a morning and you’re rate-limited until the afternoon.

ChatGPT Pro ($200/month) is “unlimited” but has abuse guardrails. Community reports suggest Pro users doing heavy automation or bulk data processing hit soft caps, get throttled, then get a temporary restriction notice. OpenAI won’t publish the actual limits.

There’s no middle tier. Anthropic offers Claude Max 5x at $100/month. OpenAI has nothing between $20 and $200.

For API users, the Batch API is the hidden gem: $1.25 input / $5 output per million tokens, half the real-time price. If your workload can tolerate a few hours of latency, route it through batch. You get the same model, same quality, 50% off.

How to decide if it’s worth switching

If you’re currently using GPT-5.2 or GPT-5.3-Codex: yes, switch. GPT-5.4 is faster, cheaper per useful output token, and handles more complex tasks. The migration is mostly a model ID swap.

If you’re on Claude Opus 4.6: switch only if you need the 1M context, tool search, or native computer use. Claude still codes better and costs less per token ($5 input vs GPT-5.4’s $2.50 looks cheaper, but Claude’s output is $25 vs GPT-5.4’s $15-20).

If you’re on Gemini 3.1 Pro: switch if you need ChatGPT’s ecosystem (Custom GPTs, DALL-E integration, web browsing). Stay if you need multimodal inputs (audio/video) or want the lowest API cost.

If you’re on ChatGPT Free or Go: you don’t have reliable access to GPT-5.4 yet (as of late March 2026). The model is advertised as available, but community reports say it’s blocked in Codex for free-tier accounts. Test before you rely on it.

Start here if you’re setting it up today

Open ChatGPT. Select GPT-5.4 Thinking from the model picker. Set thinking time to Medium. Ask it to build a financial model or debug a multi-file codebase – something you’d normally do in Excel or an IDE.

Then ask it the carwash question: “You’re at home and your car is in the driveway 100 meters away. You need to wash the car at a carwash. Should you walk to the car or drive to the car?”

If it writes an essay about walking, you’ve just learned the most important thing about GPT-5.4: it’s brilliant at structured tasks and confidently wrong at simple judgment calls.

Use it for the first thing. Don’t use it for the second.

Is GPT-5.4 actually better than GPT-5.2?

Yes, for professional work. 33% fewer factual errors, 47% more token-efficient, better at spreadsheets and tool use. But it over-reasons simple questions and burns rate limits faster. If you’re doing knowledge work or quantitative analysis, upgrade. If you’re using it for casual chat, stick with GPT-5.3 Instant.

What’s the real context limit I can trust?

About 800K tokens. The 1M window is real, but retrieval accuracy drops 15-20% for content past 800K. Plus, anything over 272K doubles your input cost and counts 2x against rate limits. For production use, design around 272K as the practical ceiling unless you’re willing to pay the surcharge.

Why does GPT-5.4 use up my rate limits so much faster?

Nobody knows. GitHub issue #13609 reports Pro users hitting caps with single-instance workloads that never capped on GPT-5.3-Codex. OpenAI hasn’t explained it. Current theory: the reasoning tokens (internal chain-of-thought) count toward usage even though you don’t see them in the output. Test with reasoning.effort set to low and see if it helps.