Here’s a detail most Vicuna tutorials skip: the project hasn’t shipped a new model in over two years. Vicuna v1.5, based on Llama 2 with 4K and 16K context lengths, was released back in August 2023 – and it’s still the latest. The repo is alive, but the model line froze. That changes how you should think about deploying it.
Vicuna is still useful, though. It’s a clean, well-documented open source chatbot that runs locally, exposes an OpenAI-compatible API, and gives you a predictable baseline for evaluating newer models against. Below is how to actually get it running in 2026 – without copy-pasting the deprecated v0 delta-weight commands that still litter Google’s top results.
What you’re actually deploying
Vicuna is not a standalone binary. You’re deploying FastChat, the serving framework from LMSYS, and pointing it at the Vicuna weights on Hugging Face. Per the FastChat README, the infrastructure powers Chatbot Arena at lmarena.ai and has served over 10 million chat requests across 70+ LLMs – so the serving layer is battle-tested even if the Vicuna model itself isn’t being updated.
The model itself: Vicuna was created by fine-tuning a Llama base model on approximately 125K user-shared conversations from ShareGPT.com. For v1.5, the base is Llama 2. According to the original LMSYS evaluation, Vicuna’s total score reached 92% of ChatGPT’s across 80 questions judged by GPT-4 – a 2023 benchmark against a 2023 ChatGPT. Don’t read it as a current comparison.
System requirements (the real ones)
The official memory numbers are a starting point, not a promise. Per the FastChat README, standard FP16 inference requires around 14 GB of GPU memory for Vicuna-7B and 28 GB for Vicuna-13B. With 8-bit quantization, roughly half that. With 4-bit (GPTQ or AWQ), roughly a quarter.
| Setup | Model | GPU VRAM | RAM | Disk |
|---|---|---|---|---|
| Minimum (4-bit) | Vicuna-7B AWQ/GPTQ | ~6 GB | 16 GB | 10 GB |
| Recommended | Vicuna-7B FP16 | 14 GB | 32 GB | 15 GB |
| Full quality | Vicuna-13B FP16 | 28 GB | 32 GB | 26 GB |
OS-wise: Linux is the path of least resistance. The FastChat README is explicit that bitsandbytes – the package required for 8-bit compression – is only available on Linux. macOS and Windows users can still run FP16, but the easy memory-saver flag won’t be there.
Install FastChat and pull Vicuna v1.5
Two paths: pip (fast) or source (for training or editing serving code). Pip is enough for inference.
# Create a clean Python 3.10 environment first
conda create -n vicuna python=3.10 -y
conda activate vicuna
# Install FastChat with model worker dependencies
pip install "fschat[model_worker,webui]"
# Optional: 8-bit support (Linux only)
pip install bitsandbytes accelerate
For the source install (needed if you want to fine-tune or edit serving code):
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install --upgrade pip
pip install -e ".[model_worker,webui]"
You don’t need to manually download weights. Pass lmsys/vicuna-7b-v1.5 as the --model-path argument and FastChat will fetch the weights from Hugging Face on first run (~13 GB download). Skip any tutorial that tells you to run fastchat.model.apply_delta – that was the v0/v1.1 LLaMA-1 workflow and it does not apply to v1.5, which ships as full weights.
First run and verification
The simplest sanity check is the CLI chat:
python3 -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.5
VRAM-constrained? Add --load-8bit and --cpu-offloading (the latter moves weights that don’t fit the GPU onto system RAM). Linux only for both flags.
python3 -m fastchat.serve.cli
--model-path lmsys/vicuna-7b-v1.5
--load-8bit
--cpu-offloading
For production, you want the three-component server: controller, model worker, and OpenAI-compatible API server. Open three terminals:
# Terminal 1 - controller
python3 -m fastchat.serve.controller
# Terminal 2 - model worker
python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
# Terminal 3 - OpenAI-compatible API
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
Test the API works:
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'
A JSON response with a choices[0].message.content field means you’re live. Any OpenAI-compatible client – LangChain, LlamaIndex, the OpenAI Python SDK with a custom base_url – can now point at localhost:8000 and speak to Vicuna directly.
Common errors and the actual fixes
The single most reported issue is CUDA OOM on cards the docs claim should work. In GitHub Issue #1761, a user on a 16 GB GPU reports the model loads but runs out of memory during inference – even with --load-8bit --cpu-offloading and --max-gpu-memory 14GiB set. The cause is memory fragmentation, not raw capacity.
Fix: Before launching, export
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:2000. This workaround, discussed in FastChat Issue #657, reduces fragmentation by capping the largest contiguous allocation PyTorch will attempt. It often turns a crash into a working session without touching any FastChat flags.
Three more failure modes:
- “bitsandbytes not compiled with GPU support” – almost always a CUDA mismatch. Reinstall PyTorch matching your driver’s CUDA version, then reinstall bitsandbytes.
- Tokenizer errors on load – if you migrated from a v1.1 cache, blow it away. The v1.5 tokenizer is slightly different and stale cache files cause silent failures.
- Model worker registers then disappears – the worker needs to reach the controller. On multi-machine setups, pass
--host 0.0.0.0and the controller’s reachable URL via--controller-address.
Should you even deploy Vicuna in 2026?
Honest question worth asking before you spend a weekend on this. Vicuna v1.5 is a Llama 2 fine-tune from August 2023. Llama 3, Llama 3.1, Mistral, Qwen 2.5, and DeepSeek have all shipped since. For raw chatbot quality, any modern 7B instruction-tuned model will outperform Vicuna-7B on MT-Bench.
So why bother? Three reasons remain valid: you need a stable, reproducible baseline for research; you’re studying the ShareGPT-style instruction-tuning recipe; or you’re running FastChat itself (which is still maintained and supports dozens of newer models – Vicuna is just one of them). If none of those apply, swap --model-path lmsys/vicuna-7b-v1.5 for --model-path meta-llama/Llama-3.1-8B-Instruct and the rest of this guide still works.
Upgrade and uninstall
Upgrading FastChat itself: pip install --upgrade fschat. There is no model upgrade to chase – v1.5 is terminal. If you want better behavior, switch base models, not Vicuna versions.
To remove cleanly:
pip uninstall fschat
rm -rf ~/.cache/huggingface/hub/models--lmsys--vicuna-7b-v1.5
conda deactivate && conda env remove -n vicuna
The Hugging Face cache is the part most people forget – it can quietly eat 30+ GB.
FAQ
Is Vicuna free for commercial use?
No – and get legal review before you try. The Llama 2 base license allows commercial use under conditions, but the ShareGPT-derived training data introduces additional restrictions that LMSYS has flagged as research-only. That combination is murky enough that shipping it in a paid product without a lawyer’s sign-off is a real risk.
Can I run Vicuna-13B on a single 16 GB GPU?
With 4-bit AWQ or GPTQ quantization, technically yes. But expect quality degradation. A more practical call: run Vicuna-7B at FP16 on the 16 GB card. If you hit OOM mid-generation despite the model loading fine, that’s the fragmentation issue from the errors section above – set PYTORCH_CUDA_ALLOC_CONF before retrying.
What’s the difference between FastChat and Vicuna?
FastChat is the serving framework. Vicuna is one model that runs on it. The FastChat README lists support for Llama 2, Alpaca, Baize, ChatGLM, Falcon, and many others – Vicuna happens to be the flagship example, not the only option. If Vicuna’s age is a problem, swap the model path and keep everything else.
Next step: spin up the controller + worker + API server stack from the install section, then point curl at localhost:8000/v1/models and confirm vicuna-7b-v1.5 appears in the response. That’s your green light to wire it into whatever client you actually care about.