If you’ve been paying for ChatGPT Plus and wondering whether you actually need to, GPT-OSS is the open-weight GPT alternative that finally makes the question worth asking. OpenAI released it on August 5, 2025 – their first open-weight model since GPT-2, licensed under Apache 2.0, and small enough that the 20B variant runs on a laptop with 16GB of memory.
The catch nobody warns you about: it works on the first try only if you understand the harmony response format. Skip that and you’ll get junk JSON, broken tool calls, and a model that fails the kinds of puzzles a 7B Qwen would solve in its sleep. This guide gets you running on the latest stack and points out the exact traps that cost me an afternoon.
What you’re actually installing (and what “open” means here)
There are two models. The official repo ships gpt-oss-20b (21B parameters, MoE with 32 experts) and gpt-oss-120b (117B parameters, 128 experts). Both are mixture-of-experts and use a 4-bit quantization scheme (MXFP4), enabling fast inference while keeping resource usage low. Context window: up to 128k tokens. Reasoning effort is configurable – low, medium, or high.
One thing competitor tutorials skip: gpt-oss is an open-weight model, not a fully open-source model. The trained weights are released so you can run inference locally, but the training data and the recipe are not provided, meaning it is not fully reproducible. If reproducibility matters to your compliance team, write that down now.
Does that distinction matter in practice? For most local deployments, no. But if you’re building a product where auditability of the full training pipeline is a legal requirement, you’re still dependent on OpenAI’s choices – you just don’t have to pay per token. Worth sitting with that before you architect anything around it.
Hardware requirements that actually reflect reality
The official numbers from OpenAI: gpt-oss-120b runs on a single 80GB GPU, and gpt-oss-20b runs on edge devices with 16GB of memory. Those are floors, not comfortable specs.
| Setup | Minimum | What actually works |
|---|---|---|
| gpt-oss-20b on laptop | 16GB unified memory / VRAM | 32GB if you keep a browser open (per community Windows 11 reports, 2025) |
| gpt-oss-120b on workstation | 1× 80GB GPU (H100) | H100/H200 or 2× 48GB GPUs via accelerate |
| Reference torch implementation | A non-optimized PyTorch implementation for educational purposes only. Requires at least 4× H100 GPUs due to lack of optimization (per openai/gpt-oss README). | |
RAM is where the spec sheet lies. For gpt-oss-20b, 8GB is the absolute minimum, 16GB is a real improvement – but the moment Chrome is open alongside it, you’ll feel it. Community testing on Windows 11 (2025) puts 32GB as the threshold where multitasking stops being painful. Translation: 16GB is the spec sheet number, 32GB is the “I can actually work while it runs” number.
Install with Ollama (the path that won’t fight you)
You have four serious options: Ollama, LM Studio, Hugging Face Transformers, and vLLM. For getting a working local instance in under ten minutes, Ollama wins. The rest are better for production servers or research notebooks.
Step 1 – Install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download OllamaSetup.exe from ollama.com
# Verify
ollama --version
Step 2 – Pull the model
For production and high reasoning tasks use ollama pull gpt-oss:120b; for lower latency and local deployment, start with ollama pull gpt-oss:20b (approximately 11GB download). Start with 20b unless you have a real GPU.
ollama pull gpt-oss:20b
ollama run gpt-oss:20b # drops you into a chat
Step 3 – Verify it’s actually working
Don’t trust the chat prompt. Hit the API directly so you know the server is up:
curl http://localhost:11434/api/generate -d '{
"model": "gpt-oss:20b",
"prompt": "Reasoning: highnnReturn the number 42 and nothing else.",
"stream": false
}'
If you get a JSON response containing “42”, you’re done. If you get an HTML error page, Ollama isn’t bound to 11434 – restart the service.
The harmony format trap that breaks production
Every tutorial skips this. If you use Transformers’ chat template, it applies the harmony response format automatically. If you call model.generate directly, you need to apply harmony manually via the chat template or the openai-harmony package – otherwise the output is malformed. Ollama handles it in chat mode, but the moment you try structured output, things break.
Before you wire GPT-OSS into a LangChain pipeline, test JSON output with a trivial schema first. gpt-oss:20b frequently produces extra commentary or incomplete JSON objects instead of schema-compliant output. LangChain and the OpenAI SDK will throw parsing errors. The fix isn’t in your code – it’s in how you prompt around harmony’s reasoning channel.
The root cause: harmony format introduces reasoning traces even when not requested, complicating schema parsing compared to other models such as Qwen3 (glukhov.org, October 2025). The model returns its chain-of-thought in an analysis channel and the final answer in a final channel. Naive parsers grab everything and choke.
A documented failure mode worth knowing: Ollama issue #11800 reports that gpt-oss returns invalid JSON in tool calls, Ollama fails to parse it, throws a 500 error to the client – and the client retries the same message in a loop. Watch your token bill if you’re proxying through a paid provider.
The reasoning_effort knob nobody mentions
Default Ollama prompts use medium reasoning. That’s why early reviewers called the model dumb.
Running the classic river-crossing puzzle twice: with default settings GPT-OSS stranded the goat. With Reasoning: high as the first line of the system prompt, it planned a flawless seven-move solution in under six seconds on an M3 laptop (binaryverseai testing, 2025). The difference is stark. Add Reasoning: high for math, code review, and multi-step logic. For chat-style replies, leave it on medium – you don’t want to burn tokens on internal monologue for a “summarize this email” task.
Per the openai/gpt-oss README: recommended sampling is temperature=1.0 and top_p=1.0. Yes, temperature 1.0 – counterintuitive, but it’s what the model was tuned for.
Common errors and fixes
- “out of memory” on a 16GB Mac – Close Chrome. The 20B model needs most of that 16GB, leaving little for anything else. macOS will swap aggressively and inference drops to one token every few seconds.
- Ollama 500 errors on tool calls – See issue #11800. Workaround: skip Ollama’s tool API for now; format tools manually in the prompt and parse the response yourself.
- JSON parsing fails downstream – Strip the
analysischannel before passing to your parser. Or use vLLM with the responses_api server, which handles harmony channels natively. - “model requires more memory than available” on 120B – You’re trying to run F16 instead of MXFP4. Ollama pulls the quantized version by default; if you grabbed GGUFs manually, check the suffix.
- Garbled output / repeated tokens – Almost always means you bypassed the chat template and didn’t apply harmony format. Re-route through the official template.
Upgrade and uninstall
ollama pull gpt-oss:20b # re-pulls if a newer manifest exists
ollama list # confirm digest changed
For Hugging Face Transformers users: GPT-OSS integration landed in version 4.55.0 (as of late 2025) – pin to transformers>=4.55 or you’ll get cryptic tokenizer errors.
ollama rm gpt-oss:20b
ollama rm gpt-oss:120b
# Then remove Ollama itself:
# macOS: drag Ollama.app to Trash, then rm -rf ~/.ollama
# Linux: sudo systemctl stop ollama && sudo rm /usr/local/bin/ollama && rm -rf ~/.ollama
The ~/.ollama directory holds your model weights. The 20b model alone is roughly 11GB; the 120b is substantially larger. Worth reclaiming if you’re done experimenting.
FAQ
Is GPT-OSS really free to use commercially?
Yes. Apache 2.0, no asterisks on commercial use. The model card is published as arXiv:2508.10925 if your legal team wants a citation.
How does it compare to closed OpenAI models?
The 120B hits near-parity with o4-mini on core reasoning tasks. The 20B is in o3-mini territory. Both figures come from OpenAI’s own benchmarks, so take them as an optimistic ceiling. In practice: strong enough for code review, data analysis, and agentic workflows. Where it falls short is open-ended creative writing and trivia – GPT-4o is noticeably better there. Pick GPT-OSS when ownership, privacy, or cost trumps the last 10% of quality.
Can I run gpt-oss-120b on two 48GB GPUs instead of one 80GB?
Yes – use Transformers with device_map="auto" or vLLM with tensor_parallel_size=2. Slightly worse throughput than a single H100 due to inter-GPU communication overhead, but it works and it’s cheaper – sometimes half the cost of an 80GB H100 rental.
Next step: open a terminal, run ollama pull gpt-oss:20b, and send the curl command from Step 3. If you get “42” back, you have a private GPT running on your machine. Then add Reasoning: high to your next real prompt and see how much smarter it gets.