Skip to content

Deploy Qwen 3 Locally: Install Guide for the Alibaba LLM

Install Qwen 3 locally with Ollama or vLLM. Real VRAM numbers, the CUDA 12.9 gotcha, and fixes for the OOM error every guide skips.

8 min readIntermediate

The first question almost everyone asks before installing Qwen 3 is the same: which size will actually fit on my GPU, and what do I have to install to make it run? That’s what this guide answers – with real VRAM numbers, exact commands, and the specific errors you’ll hit if you copy-paste from the README without checking your driver version.

Qwen 3 is Alibaba Cloud’s open-weight LLM family, announced April 29, 2025. The lineup has since grown: the Qwen3-2507 series – including Qwen3-235B-A22B-Instruct-2507, Qwen3-30B-A3B Instruct/Thinking variants, and Qwen3-4B variants – landed by August 2025, with 256K native context extendable to 1M tokens (per the official GitHub README). Everything ships under Apache 2.0, so commercial use is fine.

Pick the right Qwen 3 size for your hardware

Before installing anything, match the model to your GPU. Qwen3 ships in six dense sizes – 0.6B, 1.7B, 4B, 8B, 14B, 32B – plus two MoE models, Qwen3-30B-A3B and Qwen3-235B-A22B. The VRAM numbers below come from Spheron’s real deployment benchmarks (as of mid-2025), not the model card marketing copy.

Model Quantization Runtime VRAM Recommended GPU
Qwen3-8B FP8 ~9.2-9.6 GB RTX 4090 (24 GB)
Qwen3-14B FP16 see Spheron guide L40S (48 GB)
Qwen3-32B FP8 ~37-38 GB H100 80GB
Qwen3-30B-A3B FP8 see Spheron guide Single H100
Qwen3-235B-A22B FP8 235B total in VRAM 8× H100 cluster

Source: Spheron Network’s vLLM deployment benchmarks (RTX 4090 hits ~9.2-9.6 GB at runtime with framework overhead; H100 at ~37-38 GB for Qwen3-32B FP8). For the 14B and 30B-A3B rows, check the Spheron guide directly – exact figures depend on your framework version and batch size.

One thing the marketing copy doesn’t spell out: on the 235B-A22B MoE, “22B active” means 22B parameters fire per forward pass – the full 235B still has to sit in VRAM. You plan hardware around the total parameter count, not the active one. The “active” number tells you about compute cost, not memory cost. It’s a meaningful distinction that gets quietly skipped in most write-ups.

Install path 1: vLLM (the production path)

Most teams serving Qwen 3 in production use vLLM – it handles batching properly and exposes an OpenAI-compatible API out of the box. But there’s a catch before you even run pip install: as of mid-2025, vLLM 0.17.x pre-built wheels require CUDA 12.9 (based on PyTorch 2.10.0). Older CUDA drivers force a source build, which takes 20-30 minutes and isn’t what anyone wants on a first install. Check first:

nvidia-smi # confirm CUDA 12.9 driver
python --version # 3.10 or 3.11

# Install vLLM
pip install -U vllm

# Qwen3 requires transformers >=4.51.0 (as of mid-2025)
pip install -U "transformers>=4.51.0"

# Serve Qwen3-8B with FP8 on a single GPU
vllm serve Qwen/Qwen3-8B 
 --quantization fp8 
 --max-model-len 32768 
 --gpu-memory-utilization 0.9 
 --port 8000

That last command exposes an OpenAI-compatible endpoint at http://localhost:8000/v1. The transformers version pin matters – Qwen3 model loading fails on older versions. If you’re on an A100, skip the --quantization fp8 flag (the error section below explains why).

Install path 2: Ollama (the 5-minute path)

Ollama if you want something running in five minutes. As of mid-2025, v0.9.0 or higher is recommended. Start the service, pull a model, done:

# Linux/macOS install
curl -fsSL https://ollama.com/install.sh | sh
ollama --version # verify

# Run Qwen3 (downloads on first call)
ollama run qwen3:8b

# Inside the session:
# /set think (enable thinking mode for 2504 models)
# /set nothink (disable)
# /bye (exit)

Type /set parameter num_ctx 40960 inside the prompt to push the context window past the default. Ollama also exposes an OpenAI-compatible API on localhost:11434/v1 – useful if you’re plugging Qwen 3 into an existing app that already speaks that protocol.

Verify the install works

Don’t trust the boot logs. Actually fire a request:

curl http://localhost:8000/v1/chat/completions 
 -H "Content-Type: application/json" 
 -d '{
 "model": "Qwen/Qwen3-8B",
 "messages": [{"role": "user", "content": "Reply with the word OK only."}],
 "max_tokens": 10
 }'

JSON response with content “OK” – you’re live. A 500 means scroll the vLLM logs; the error is almost always one of the three below.

Common install errors and the actual fixes

Every other tutorial waves its hands and says “check your CUDA”. Here are specific errors from GitHub issues, with the real fix.

Error 1: CUDA out of memory on model load. Drop --max-model-len first – it directly controls KV cache pre-allocation, and the default is often too high for the card you’re on. Still OOM? Switch to a pre-quantized AWQ or GPTQ variant (search Hugging Face for Qwen3-8B-AWQ) and use --quantization awq or --quantization gptq. One thing to watch: --quantization int4 is not a valid vLLM value – the valid options as of vLLM 0.17.x are awq, gptq, fp8, and bitsandbytes.

Error 2: FP8 runs slowly on A100. A100 doesn’t have hardware FP8 Tensor Cores – passing --quantization fp8 works but falls back to emulation and runs slow. Use INT8 on A100 instead. RTX 4090 does have FP8 support, but lower memory bandwidth limits throughput compared to H100.

Error 3: The INT4 variant uses MORE memory than FP8. Counterintuitive but real. A documented vLLM issue shows Qwen3-FP8 loading fine on L40S, while the GPTQ-Int4 variant of the same model – using the moe_wna16 kernel – crashes with OOM under identical parameters (see vLLM issue #37080). If you’re sizing for a smaller card and assuming “smaller quantization = less VRAM,” test the FP8 variant first.

Pro tip: Run nvidia-smi in a separate terminal during your first request. If GPU utilization stays below 30%, your quantization flag probably didn’t take effect – vLLM falls back silently rather than crashing.

It’s worth pausing here to think about what quantization actually is and isn’t. Choosing a smaller number (INT4 vs FP8 vs FP16) doesn’t automatically mean less memory – the overhead from specialized kernels, MoE routing buffers, and KV cache allocation can easily flip that assumption. The right quantization depends on your specific GPU’s hardware capabilities and the model’s architecture. There’s no universal “smallest is cheapest” rule.

The thinking-mode trap nobody mentions

Qwen 3’s headline feature is the dual-mode design. The Qwen3 technical report (arXiv:2505.09388) describes it as a unified framework integrating thinking mode for multi-step reasoning and non-thinking mode for rapid responses, with dynamic switching based on user queries or chat templates.

Here’s the trap if you’re using the Alibaba Cloud API rather than self-hosting: the open-source Qwen3 model doesn’t support non-streaming output in thinking mode. Enable thinking, but get no thinking trace? Billing applies at the non-thinking rate. Per Alibaba Cloud’s Model Studio docs, you have to stream the response to actually see the thinking output. Worth knowing before you spend an afternoon debugging “why isn’t the model thinking”.

Upgrade and uninstall

vLLM upgrades: pip install -U vllm, then restart the server. Re-check CUDA compatibility each time – vLLM bumps minor versions aggressively, and a new wheel may pin a higher CUDA than your current driver. Model weights live at ~/.cache/huggingface/hub; run huggingface-cli scan-cache to see what’s there and how much disk it’s using.

For Ollama: ollama rm qwen3:8b removes a specific model. Full removal on Linux – stop the service, delete the binary, clear the share directory:
sudo systemctl stop ollama && sudo rm -rf /usr/local/bin/ollama /usr/share/ollama. Either way, don’t forget the model cache – it can hit 100+ GB if you’ve pulled multiple sizes.

FAQ

Can I run Qwen3-235B locally?

Not really. You’d need roughly 8× H100 to hold the full 235B weights in VRAM. For single-GPU local use, Qwen3-32B or Qwen3-30B-A3B is the practical ceiling.

What’s the difference between Qwen3-30B-A3B and Qwen3-32B?

Same neighborhood of capability, very different runtime. Qwen3-30B-A3B is a Mixture-of-Experts model: it activates roughly 3B parameters per token, which makes inference faster – but you still need the full 30B weights in memory. Qwen3-32B is dense: slower per token, simpler to reason about, and historically better-supported in niche inference frameworks. Throughput bottleneck? Pick the MoE. Compatibility concerns with a specific tool? Pick the dense one.

Does Qwen 3 really handle 1 million tokens?

Depends on the variant. Base Qwen3 is 32K native, extendable to 131K via YaRN. The 1M figure applies specifically to the Qwen3-2507 series (256K native, 1M with scaling). Don’t budget hardware around the 1M window as a routine use case.

Next step: match your GPU to the table above, run the corresponding vllm serve command, and test with the curl call. If you hit one of the three errors, the fix is already here.