The goal: a ChatGPT response that’s varied enough to feel intelligent, but stable enough that you can build something on top of it. A legal summary that doesn’t invent clauses. A brainstorm that doesn’t loop on the same three ideas. A code snippet that compiles the second time you run the prompt.
That stability lives inside two sliders most people never touch: temperature and top_p. This guide walks backwards from that tuned output – through the math, the defaults, the lies the docs tell, and the edge cases that will burn you in production.
The scenario: you’re already past prompt engineering
You’ve written a prompt that mostly works. Mostly. Roughly one call in five drifts – too verbose, too florid, or a JSON field reshuffled in a way that breaks your parser. You’re not going to fix this by adding another “be concise” to the system message.
This is the moment ChatGPT temperature and top_p settings start to matter. They’re not creativity dials. They’re the gates that decide which tokens the model is allowed to consider at each step of generation – and how that probability mass gets reshaped before sampling.
What each parameter actually does (skip if you know softmax)
Here’s the core of temperature: it’s a scalar divisor applied to the model’s raw logits before softmax. The formula is softmax(x / T). Turn T down and the distribution sharpens – the single most likely token pulls away from the field. Turn T up and the curve flattens out, giving long-shot tokens a real shot at being picked. At T=2, those long shots include tokens that have no business appearing in a coherent sentence – more on that in the limitations section.
Top_p works differently, and the distinction matters. Rather than reshaping the whole distribution, it draws a cutoff line. The model ranks tokens by probability from highest to lowest, sums them up, and stops when it hits the threshold – say, 0.9. Only the tokens inside that cutoff are in play; everything below the line gets zeroed out. The technique comes from Holtzman et al., “The Curious Case of Neural Text Degeneration” (ICLR 2020), whose argument was that sampling from a dynamic nucleus – rather than the full vocabulary – kills the weird tail of tokens that high temperature alone keeps inviting in, yielding diversity without sacrificing coherence. One edge case worth understanding: if the top token already sits at 0.95 probability, then top_p=0.9 effectively means only that one token survives the cut. So top_p=0.9 doesn’t always mean “10% of options” – sometimes it means one, full stop.
Setting them up (and where you actually can)
First, the unglamorous truth: as of mid-2025, the ChatGPT app at chat.openai.com does not expose these settings. No slider, no menu. You need the OpenAI Playground or the API.
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize this contract clause in one sentence."}],
temperature=0.2, # tight, factual
top_p=1.0 # leave nucleus open
)
print(resp.choices[0].message.content)
Two things to notice. First, I’m tuning one parameter and leaving the other at its neutral default – OpenAI’s own documentation recommends adjusting either temperature or top_p, not both simultaneously. Second, the model matters. See the next section before you copy this verbatim.
The model trap: GPT-5 and o-series silently reject your settings
If you send a temperature value to a reasoning model, you don’t get a slightly different answer. You get a 400. The exact error from the API (as reported in the OpenAI developer community, August 2025): “Unsupported value: ‘temperature’ does not support 0.2 with this model. Only the default (1) value is supported.” Same story for o1, o3, and the rest of the reasoning lineup – per OpenAI’s reasoning guide, as of late 2025, temperature, top_p and n are fixed at 1, while presence_penalty and frequency_penalty are fixed at 0.
So any code path that hands a prompt to GPT-5 with your carefully tuned temperature=0.3 needs to either strip the parameter or branch on model name. There’s no graceful degradation built in.
The determinism myth
The most repeated piece of advice in every other tutorial: “set temperature=0 if you need deterministic output.” It’s wrong. Or at best, it’s a 2022-era half-truth that everyone copy-pasted.
Pro tip: If you need byte-for-byte reproducibility – for evals, regression tests, or audit logs – log the raw output hash on every call. Don’t trust that temperature=0 + seed=42 gives you the same string twice. It usually doesn’t.
The number that should kill this myth for good: 12.5%. That’s how often a 120-billion-parameter model produced identical outputs across ten deterministic runs, according to a 2025 arXiv study quantifying nondeterministic drift in large language models. For comparison, 7-8B models were fully consistent under the same settings. The gap isn’t a bug – it’s a structural property of large models running across many GPUs.
The causes are mundane and unfixable from your side: floating-point non-associativity across GPUs, batch-dependent routing in Mixture-of-Experts architectures, and the fact that even at T=0, ties between equally likely tokens still need to be broken somehow. The bigger the model, the more ties.
The advice nobody gives you: top_p might matter more
Every tutorial leads with “tune temperature first.” There is now empirical evidence pushing the opposite direction, at least for code. A 2025 study testing ChatGPT on 548 Java method generations found temperature had a marginal impact compared to the more prominent influence of top_p on output correctness – the opposite of the standard ranking.
That doesn’t mean temperature is useless. It means the standard ranking – temperature is the main knob, top_p is the backup – is probably backwards for structured-output tasks. If you’ve been sweeping temperatures and seeing nothing change, try fixing temperature at 1.0 and sweeping top_p between 0.5 and 0.95 instead.
A starting matrix that won’t waste your tokens
| Task | Temperature | Top_p | Why |
|---|---|---|---|
| Structured extraction (JSON, SQL) | 0.0-0.2 | 1.0 | Sharpen distribution, leave nucleus open so rare-but-correct tokens stay reachable |
| Summarization with citations | 0.3 | 1.0 | Slight variation, no hallucinated tail tokens |
| Code generation | 1.0 | 0.6-0.8 | Per the 548-method study – top_p does the real work here |
| Marketing copy / brainstorm | 0.9-1.1 | 1.0 | Want surprise but not gibberish |
| Pure ideation, no quality bar | 1.2+ | 0.95 | Cap the tail or you’ll get nonsense words |
These are starting points, not gospel. Tune one variable, hold the other steady, and measure on at least 20 outputs per cell – single-shot evaluation is meaningless given the determinism issue above.
Honest limitations
Before you ship anything that depends on these settings:
- Streaming changes behavior. Per community reports, when stream=True is set, temperature is applied per-token, which can produce different outputs than the non-streaming call with identical parameters. Test both modes if you switch between them in production.
- Temperature=2 is mostly a curiosity. At the top of the range (0-2, as of mid-2025 per the OpenAI API docs), outputs frequently break down – learnprompting.org documents examples like “Start a sponge-ball baseball home run contest near Becksmith Stein Man Beach.” The usable range is roughly 0-1.3 for most models.
- Top_p has counterintuitive behavior on peaked distributions. If the single most likely token already has a probability of 0.95, then top_p=0.9 effectively means only that token is selected. So top_p=0.9 doesn’t always mean “10% of choice” – sometimes it means none.
And one philosophical limitation worth sitting with: these parameters control sampling, not reasoning. A wrong answer at temperature 0.0 will be a wrong answer at temperature 1.5 – just phrased differently. If your output is bad in kind, no slider will save you. Fix the prompt first.
FAQ
Can I change temperature in the regular ChatGPT app?
No. The consumer chat.openai.com interface doesn’t expose temperature or top_p. You need the Playground or the API.
Should I set temperature=0 and top_p=0 together for the most deterministic output possible?
Setting both to extreme values doesn’t help and can actively confuse things. Pick one to tune and leave the other at its neutral default (temperature=1 if you’re using top_p, top_p=1 if you’re using temperature). Stacking both at zero also won’t give you true determinism – that’s a stack-level property of how the model is served, not something you control from the API. If reproducibility actually matters, log outputs and compare, don’t assume.
Why does temperature work on GPT-4o but throw an error on GPT-5?
GPT-5 and the o-series are reasoning models, and OpenAI locks their sampling parameters to fixed values so the chain-of-thought scaffolding behaves consistently. The fix is to detect the model server-side and strip the temperature parameter before the call when targeting a reasoning model. Don’t try to set it to 1.0 explicitly either – some endpoints reject the field’s presence entirely, not just non-default values.
Next step: open the Playground, pick one prompt you actually care about, and run it 10 times at temperature 0.2 and 10 times at temperature 0.8 with top_p locked at 1.0. Read all 20 outputs. The intuition you build from that single hour beats anything you can learn from reading more articles.