Skip to content

Deploy SGLang v0.5.12: Structured Generation LLM Guide

Install SGLang v0.5.12 and serve structured-output LLMs with XGrammar-backed JSON schemas, regex, and EBNF constraints - with real install pitfalls fixed.

8 min readIntermediate

If you’ve ever asked an LLM to “return valid JSON” and watched it confidently drop a closing brace on row 47 of a batch job, you already know why structured generation matters. SGLang solves that at the token level – it doesn’t ask the model nicely, it constrains the sampler so the next token must keep the output schema-valid. Deploy it once and your json.loads() stops being a coin flip.

This guide gets SGLang v0.5.12 running on a single NVIDIA GPU, then wires up a real structured-generation request with a Pydantic schema. No marketing tour – straight to the install failure modes that eat an afternoon if nobody warned you.

Why SGLang for structured generation

6.4x higher throughput than the top inference systems of the time. That’s the headline from the NeurIPS 2024 paper, which introduced RadixAttention for KV cache reuse and compressed finite state machines for faster constrained decoding. It’s a UC Berkeley / Stanford research project that escaped the lab and went wide.

The structured-generation piece is the part most tutorials gloss over. Three grammar backends ship with SGLang: XGrammar (the default – handles JSON schema, regex, and EBNF), Outlines (JSON schema and regex only), and Llguidance (JSON schema, regex, EBNF). XGrammar is the one you want for performance. Turns out the speedup is serious: the MLC team’s November 2024 benchmarks show up to 14x faster JSON-schema generation and up to 80x in CFG-guided generation vs prior engines.

As of mid-2025, SGLang runs on over 400,000 GPUs in production – described on the project’s documentation homepage as the dominant open-source LLM inference engine by deployment scale. If you’re choosing between vLLM and SGLang and constrained decoding is the priority, SGLang is the safer bet.

System requirements

GPU first. FlashInfer – the default attention kernel – only supports sm75 and above. That means T4, A10, A100, L4, L40S, H100, H200, B200. Older cards won’t work. On any GPU where FlashInfer misbehaves, the fallback flag is --attention-backend triton --sampling-backend pytorch – keep it handy.

Component Minimum Recommended
GPU NVIDIA sm75+ (T4, A10) A100 / H100 / H200 / B200
VRAM (7B model, fp16) 16 GB 24 GB+
CUDA 12.x (use -cu12 image) 13.0 (default as of v0.5.12)
Python 3.10 3.11
RAM 32 GB 64 GB+
Disk 50 GB (for model weights) 200 GB NVMe
OS Ubuntu 22.04 Ubuntu 22.04 / 24.04

The catch: as of v0.5.12, the default CUDA environment is 13.0 and PyTorch ships as 2.11 (up from 2.9). CUDA 12 users need images with the -cu12 or -cu129 suffix, otherwise the Docker image will silently mismatch your driver.

Install SGLang v0.5.12 (pip path)

The official install page recommends uv over plain pip – faster dependency resolution, fewer version conflicts. From a clean Python 3.11 virtualenv on Ubuntu 22.04:

# Step 1 - base tooling
pip install --upgrade pip
pip install uv

# Step 2 - install SGLang (pulls torch, flashinfer, xgrammar, sgl-kernel)
uv pip install sglang

# Step 3 - verify the import works
python3 -m sglang.check_env

That check_env step? Almost nobody runs it first. It prints your CUDA version, GPU compute capability, PyTorch build, FlashInfer version, and XGrammar version – exactly the info you’d otherwise hunt for after something breaks.

To pin to v0.5.12 specifically (recommended for production):

git clone -b v0.5.12 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python"

Or skip the dependency hell – use Docker

Honestly? Unless you have a specific reason to run on bare metal, just use the image.

docker run --gpus all 
 --shm-size 32g 
 -p 30000:30000 
 -v ~/.cache/huggingface:/root/.cache/huggingface 
 --env "HF_TOKEN=<your-hf-token>" 
 --ipc=host 
 lmsysorg/sglang:latest 
 python3 -m sglang.launch_server 
 --model-path Qwen/Qwen2.5-7B-Instruct 
 --host 0.0.0.0 --port 30000

Two things people miss: --shm-size 32g is not optional (the scheduler uses shared memory between worker processes), and SGLang ships with CUDA 13 by default – CUDA 12 users need lmsysorg/sglang:latest-cu129 instead of :latest.

First request: actually using structured generation

Server’s up on port 30000. Now the interesting part. SGLang exposes an OpenAI-compatible /v1/chat/completions endpoint, but the structured-output magic lives on the lower-level /generate endpoint where you pass sampling_params directly.

import json, requests
from pydantic import BaseModel

class Invoice(BaseModel):
 vendor: str
 amount_usd: float
 line_items: list[str]

schema = Invoice.model_json_schema()

resp = requests.post(
 "http://localhost:30000/generate",
 json={
 "text": "Extract invoice fields:nnACME Corp billed us $1,240.50 for hosting and SSL.nnJSON:",
 "sampling_params": {
 "max_new_tokens": 256,
 "temperature": 0,
 "json_schema": json.dumps(schema),
 },
 },
)
print(json.loads(resp.json()["text"]))

That json_schema field is the whole point. XGrammar compiles it into a token mask and applies it at every decoding step – the output is guaranteed parseable. Swap in "regex": r"d{3}-d{2}-d{4}" for an SSN-style format, or "ebnf": "..." for full context-free grammar.

Watch out: Only one constraint parameter – json_schema, regex, or ebnf – can be set per request (per the official structured-outputs docs). Setting two doesn’t throw a clear error; it picks one and silently ignores the other. If your output looks unconstrained, check this first.

Verify the install works

Three checks. If any fails, stop and fix it – the failure cascades get ugly fast.

  1. Health endpoint:curl http://localhost:30000/health should return 200.
  2. Version sanity:python3 -c "import sglang; print(sglang.__version__)" – should match what you installed.
  3. Smoke generation: POST to /generate with {"text": "Once upon a time", "sampling_params": {"max_new_tokens": 16, "temperature": 0}} and check you get coherent text back.

Step 3 returns tokens but step 1 doesn’t return 200? Routing issue, not an SGLang issue – check that --host 0.0.0.0 was passed and nothing else is squatting on port 30000.

Install errors most tutorials skip

Error: OSError: CUDA_HOME environment variable is not set

Hits during sgl-kernel build. Fix: export CUDA_HOME=/usr/local/cuda-13.0 (adjust for your version). If you only have the CUDA runtime – not the full toolkit – install FlashInfer separately first, then SGLang on top.

Error: fatal error: cuda_fp8.h: No such file or directory

Nasty one. FlashInfer JIT-compiles kernels at first launch and needs cuda_fp8.h – even when you installed from a pre-compiled wheel, even when you pass --disable-cuda-graph. The flag doesn’t save you (GitHub issue #5389). Fix: install the full CUDA Development Toolkit on the host, not just the runtime. Or run the Docker image, which already includes the headers.

Server hangs at startup with no error

Multi-tenant GPU box with NVIDIA MPS running? That’s likely the bug. SGLang hangs indefinitely under the NVIDIA MPS daemon – no error produced, process never completes startup (GitHub issue #22192). Both vLLM and TEI work fine under the same MPS setup. Only SGLang trips here. Workaround: stop the MPS daemon before launching, or accept driver time-slicing instead.

Blackwell B300/GB300: ptxas fatal: Value 'sm_103a' is not defined

Triton’s bundled ptxas doesn’t know about SM103 yet. One-line fix from the official install docs:

export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

Points Triton at the system CUDA’s newer ptxas binary. Done.

FlashInfer cache corruption after upgrade

Kernels behaving weirdly after an upgrade? Nuke the JIT cache:

pip3 install --upgrade flashinfer-python --force-reinstall --no-deps
rm -rf ~/.cache/flashinfer

Upgrade and uninstall

Upgrading is normally uv pip install -U sglang. But there’s a release-cadence detail worth knowing. After a December 2025 incident – a release shipped with an RC dependency that broke certain models, prompting a v0.5.6.post2 revert (GitHub issue #14964) – the project moved to a ~3-week release cadence. New model support rolls through nightly Docker images first before being promoted to an official release.

Practical upshot: if your shiny new model from this week isn’t supported on the stable release, you need the nightly image (lmsysorg/sglang:nightly-dev-cu13-<date>). The stable pip install sglang may be 2-3 weeks behind.

# Full uninstall
pip uninstall sglang sgl-kernel flashinfer-python xgrammar
rm -rf ~/.cache/flashinfer ~/.cache/sglang
# Docker: docker rmi lmsysorg/sglang:latest

FAQ

Does SGLang need a GPU to use the structured generation features?

No. pip install sglang[openai] lets you build structured-generation programs that target the OpenAI API from a CPU-only laptop. GPU required only when serving a local model.

Can I combine json_schema with regex in the same request?

You can’t – only one of json_schema, regex, or ebnf may be set per request. If your schema needs both – say, a JSON object where one field must match a phone-number pattern – encode the constraint inside the JSON schema itself using the "pattern" keyword. JSON Schema’s pattern field expresses regex constraints within the schema structure, so you stay within a single json_schema call.

How does SGLang compare to vLLM if I only care about structured outputs?

Both ship XGrammar as a backend now, so raw constrained-decoding speed is similar. Where SGLang pulls ahead is the combination with RadixAttention – if your structured-output workload reuses prompt prefixes (system prompts, few-shot examples, RAG passages), the prefix-cache hit reduces time-to-first-token meaningfully. Single-shot one-off requests with no shared prefix? The gap closes. Benchmark on your actual prompt distribution before committing either way.

Run this against your install: python3 -m sglang.bench_serving --backend sglang --model Qwen/Qwen2.5-7B-Instruct --port 30000 --dataset-name random --random-input 512 --random-output 256 --max-concurrency 16. Your hardware, your prompts – that number means more than any benchmark in a blog post.