Skip to content

Deploy Qwen3 Locally: Chinese Open Source LLM Guide

Step-by-step deployment of Qwen3, the Chinese open source LLM from Alibaba. Real VRAM numbers, install commands, and the gotchas tutorials skip.

8 min readIntermediate

Hot take: most people deploying a Chinese open source LLM reach for Qwen3-235B and immediately regret it. The 30B-A3B MoE variant gives you 90% of the quality at a fraction of the VRAM, and the 8B dense model runs on a laptop GPU. Pick the variant that matches your hardware, not the one with the biggest number on the model card.

This guide walks you through deploying Qwen3 locally – the actual commands, the actual VRAM numbers, and the install traps that nobody mentions until you hit them. We’ll skip the marketing tour of Alibaba’s family tree and get straight to running the thing.

Which Qwen variant should you actually deploy?

Before you pull anything, match the model to your GPU. The numbers below are real measurements at Q4_K_M quantization with an 8K context window – not theoretical floor specs.

Variant VRAM at Q4 Realistic GPU Use when
Qwen3-8B ~4.6 GB RTX 3060 12GB Single-user chat, drafting
Qwen3-14B ~8.3 GB RTX 4070 / 4060 Ti 16GB Coding, longer reasoning
Qwen3-30B-A3B (MoE) ~16.8 GB RTX 4090 / L40S Best quality-per-VRAM ratio
Qwen3-235B-A22B ~132 GB 8x H100/H200 Production, near-frontier quality

Per community VRAM benchmarks (2026): Qwen3-8B needs ~4.6 GB at Q4, the 14B needs ~8.3 GB, the MoE 30B-A3B lands at ~16.8 GB, and the full 235B-A22B needs ~132 GB. One critical clarification on that last row: the “22B active” refers to the expert parameters activated per forward pass – not the total model size. The full 235B weights still need to live in VRAM. Plan hardware off the total size column, not the active figure.

That 30B-A3B number is what surprises most people. A single RTX 4090 running a 30B-class model – with quality close to the flagship – is not a setup most tutorials bother to highlight. The MoE architecture makes it possible: only a fraction of parameters fire per token, so compute is cheap even though storage isn’t. For most local deployments that aren’t serving dozens of concurrent users, this is the right call.

System requirements (the honest version)

  • OS: Linux (recommended), macOS with Apple Silicon, Windows via WSL2
  • CPU: Any modern x86_64 or ARM64 – CPU is not the bottleneck
  • RAM: 16 GB minimum for 8B models, 32 GB+ for 30B MoE with offloading
  • Disk: 10-250 GB depending on variant and quantization
  • GPU: NVIDIA with CUDA 12+ recommended; ROCm works on AMD via llama.cpp
  • Python: 3.10-3.12, with transformers>=4.51.0 required (as of the April 2025 release) for the Hugging Face path

If you’re going the vLLM route, one CUDA gotcha upfront: as of mid-2025, vLLM 0.17.x pre-built wheels require CUDA 12.9 (based on PyTorch 2.10.0). Older CUDA versions require building from source. For Qwen3.6 specifically, vllm>=0.19.0 is recommended per the Qwen3.6-27B model card.

Install via Ollama (the 60-second path)

Install Ollama, pull, run. Three commands.

# macOS
brew install ollama
ollama serve &

# Linux one-liner
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run
ollama pull qwen3:8b
ollama run qwen3:8b

That works. But here’s the part the official tutorials skip:

Critical setting: Override Ollama’s defaults before your first real prompt. Per the Qwen3 README, the default num_ctx=2048 with num_predict=-1 can cause erratic generation behavior – the model silently truncates context and starts looping with no error message. Inside the chat, run /set parameter num_ctx 32768 and /set parameter num_predict 4096 before sending anything real.

Skip that step and you’ll get truncations and infinite loops with nothing in the logs to explain why. It’s the single most common “Qwen3 is broken” report, and it’s always this.

Install via vLLM (the production path)

Ollama wraps llama.cpp and is built for convenience. vLLM uses PagedAttention and is built for throughput. Different tools, different jobs – don’t treat them as interchangeable.

# Fresh venv recommended
python -m venv qwen-env && source qwen-env/bin/activate
pip install -U vllm

# Serve Qwen3-8B on a single 24GB card
vllm serve Qwen/Qwen3-8B 
 --port 8000 
 --max-model-len 32768 
 --reasoning-parser qwen3

# Test it
curl http://localhost:8000/v1/chat/completions 
 -H "Content-Type: application/json" 
 -d '{"model":"Qwen/Qwen3-8B","messages":[{"role":"user","content":"ping"}]}'

For tool calling, add --enable-auto-tool-choice --tool-call-parser qwen3_coder. To disable thinking mode server-side rather than per-request, add --reasoning-parser qwen3 --default-chat-template-kwargs '{"enable_thinking": false}'.

First-time configuration

Qwen3 runs in two modes that affect output quality and latency in opposite directions. Thinking mode has the model reason step by step before answering – slower, better for hard problems. Non-thinking mode skips that internal chain-of-thought and returns fast responses – good for simple Q&A or high-throughput pipelines where depth isn’t the priority. According to the Qwen3 release announcement, both modes are available within the same model weights; you switch via the enable_thinking parameter in your request body, or set it server-side as shown above.

For long-context work – the kind that actually justifies running these models locally – Qwen3-Thinking-2507 supports 256K context understanding, extendable up to 1 million tokens via YaRN scaling. You won’t get that out of the box; you need YaRN scaling enabled in config.json and a serving framework that respects it. How often you’ll actually hit that ceiling depends entirely on your workload. Most chat and coding use cases stay under 32K. The million-token figure is real, but it’s not a reason to over-provision hardware.

Verify it works

Three quick checks to confirm a healthy install:

  1. Version check:ollama --version or vllm --version
  2. GPU residency: Run nvidia-smi while a request is in flight. GPU utilization stuck below 50%? Layers have silently offloaded to CPU – you’re running 5-10x slower than you should be.
  3. Round-trip test: Send a 2,000-token prompt and ask for a summary. Completes cleanly with no truncation? Your context settings stuck.

Common errors and fixes

OOM on model load. Per the Spheron deployment guide: reduce --max-model-len first – it directly controls KV cache pre-allocation. Still OOM? Switch to a pre-quantized AWQ or GPTQ variant (search HuggingFace for Qwen3-8B-AWQ) and pass --quantization awq or --quantization gptq. Note: int4 is not a valid vLLM --quantization value; valid options are awq, gptq, fp8, and bitsandbytes.

FP8 flag does nothing on A100. The --quantization fp8 flag requires hardware FP8 Tensor Cores. A100 doesn’t have them – use INT8 instead. H100, H200, RTX 4090, and L40S all support FP8 in hardware. A100 does not. Passing the flag on an A100 won’t error out; it just silently underperforms.

Gibberish output with Qwen3.6. Upgraded to the Qwen3.6 family and getting garbage tokens? Check your CUDA driver first. CUDA 13.2 has a known bug that produces gibberish outputs with Qwen3.6 – documented by the Buildfastwithai team in their 2026 review. Roll back to 12.x or move to a confirmed-working 13.x release.

Qwen3.6 won’t load in Ollama. This one bites a lot of users. Ollama currently does not support Qwen3.6 GGUFs – the issue is separate mmproj vision files that Ollama’s GGUF loader can’t handle. Switch to llama.cpp or Unsloth Studio. If staying on Ollama matters, stick with the Qwen3 (April 2025) or Qwen3-2507 line for now.

Upgrade and uninstall

Ollama upgrades: ollama pull qwen3:8b again – it diffs against your local copy. Remove a model: ollama rm qwen3:8b. Full uninstall on Linux: sudo systemctl stop ollama && sudo rm /usr/local/bin/ollama && sudo rm -r /usr/share/ollama.

vLLM is just a Python package. pip install -U vllm upgrades; pip uninstall vllm removes. Model weights live under ~/.cache/huggingface/hub/ – delete that directory to reclaim disk.

Alibaba ships Qwen updates on a rolling cadence, so pinning to an early-2025 build leaves quality on the table. The -2507 refresh (released July-August 2025 per the Qwen3 GitHub changelog) covers the 235B-A22B-Thinking, 30B-A3B-Instruct, and 4B Instruct/Thinking variants – same hardware budget, measurable quality improvement.

FAQ

Can I commercially use Qwen3 without paying Alibaba?

Yes. All dense Qwen3 models – 32B, 14B, 8B, 4B, 1.7B, 0.6B – and the MoE variants ship under Apache 2.0. Build products, sell services, no royalties.

How does Qwen3 compare to DeepSeek-R1 and Llama on real benchmarks?

The Qwen team reports the 235B-A22B trades blows with DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro on coding, math, and general tasks. Take that as the vendor’s claim – verify on your own workload before committing infrastructure. The full methodology is in the Qwen3 Technical Report (arXiv:2505.09388), which is worth reading if you’re deciding between this and another frontier-class open model.

Should I use Ollama or vLLM?

Ollama for laptops, prototyping, and one-user-at-a-time chat. vLLM when you need concurrent requests, batching, or an OpenAI-compatible API for an application. The short version: Ollama = convenience, vLLM = throughput.

Next step: open a terminal, run ollama pull qwen3:8b && ollama run qwen3:8b, set /set parameter num_ctx 32768 on first launch, and throw a real workload at it. You’ll know within 10 minutes whether the 8B fits your needs or you need to size up.