Deploy LLaVA Open Source Vision Model: 2026 Install Guide

Step-by-step LLaVA-1.5 install on Ubuntu + CUDA, with VRAM specs, real GitHub URLs, and fixes for the flash-attn errors most guides skip.

Jordan West2026-06-067 min readIntermediate

By the end of this guide you’ll have LLaVA-1.5 7B running on an Ubuntu box with a single NVIDIA GPU, answering questions about images you feed it from the command line. Not a hosted demo. Not a Colab notebook. Local inference, your weights, your hardware.

The deployment path for the LLaVA open source vision model is well-trodden but full of small traps – wrong fork, broken flash-attn, silent Windows quantization failures. We’ll walk through each one in order.

Pick the right LLaVA repo first (this trips up most guides)

There isn’t one “LLaVA” anymore. As of mid-2026, three actively-referenced repos exist, and following install steps from the wrong one wastes hours:

Repo	What it is	Use when
haotian-liu/LLaVA	The original. LLaVA-1.5 codebase, NeurIPS 2023 paper. Last major update 2023.	You want the canonical, well-documented install
LLaVA-VL/LLaVA-NeXT	LLaVA-1.6, LLaVA-OneVision, LLaVA-Video. The current active fork.	You need better OCR, higher-res images, or video
EvolvingLMMs-Lab/LLaVA-OneVision-2	Next-gen 8B multimodal model (released April 2026) unifying image, long-form video, and spatial reasoning.	You’re training, not just inferring, and have a multi-GPU node

It’s a bit like the npm ecosystem circa 2016: one package name, three forks, all with slightly incompatible install paths. The original repo is stable and well-documented. The active development has moved. Knowing which you’re cloning before you type git clone saves a debugging session.

This tutorial uses haotian-liu/LLaVA for LLaVA-1.5 – most stable for a first deployment. Want LLaVA-1.6 instead? The same conda environment works; only the model name changes at load time.

System requirements (with the real VRAM numbers)

Skip this section and you’ll find out the hard way that your 8GB card can’t load the 13B model.

OS: Ubuntu 20.04 or compatible Linux (as of 2026). Windows is not officially tested – WSL2 is your alternative. macOS support exists but without GPU acceleration.
GPU VRAM: 8 GB minimum for the 7B model with 4-bit quantization; 16 GB for the 13B model. With 4-bit quantization, the 13B can squeeze onto 12 GB – but expect slower responses.
CUDA: Driver supporting CUDA 11.8 minimum. Check with nvidia-smi.
Python: 3.10 is what the maintainers test against. 3.8+ works.
Disk: 30+ GB free where Hugging Face caches models. The LLaVA-1.6 13B weights alone are 25 GB.
Git LFS: Required for weight fetching. Run git lfs install after installing Git LFS.

One Windows note pulled directly from the docs: 4-bit and 8-bit quantization are NOT supported on Windows (as of the current README). No GPU below 24 GB? Switch to WSL2 now – native Windows won’t get you there.

Install LLaVA-1.5 step by step

# 1. Clone the repo
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

# 2. Create the conda environment (Python 3.10)
conda create -n llava python=3.10 -y
conda activate llava

# 3. Editable install
pip install --upgrade pip # enable PEP 660 support
pip install -e .

# 4. Training/inference extras (optional but recommended)
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

That last line is where most installs break. Flash-attn compiles from source – and without ninja installed first, that compilation runs on a single CPU core. Two hours. With ninja it’s 3-5 minutes on a 64-core machine (per the flash-attn PyPI page). Install ninja first: pip install ninja. Under 96 GB RAM? Add MAX_JOBS=4 before the flash-attn install or it will exhaust memory mid-compile.

Use conda, not bare pip or uv. The flash-attn build needs nvcc in PATH. Conda environments expose it. Pip/uv environments don’t – that’s the root cause behind flash-attention issue #1736. If you hit CUDA_HOME environment variable is not set during the build, this is why.

Sort out your cache location before the 25 GB download starts

LLaVA pulls weights from Hugging Face on first use. On a personal machine this is fine. On a shared HPC or university cluster – where your home directory might have a 20 GB quota – it is not fine.

# Redirect the HF cache before running anything
export HF_HOME="$(pwd)/huggingface_cache"
mkdir -p "$HF_HOME"

# Persist across sessions
echo 'export HF_HOME="'"$(pwd)"'/huggingface_cache"' >> ~/.bashrc

The download won’t warn you about space. It will die halfway through with a disk-full error, leaving partial weight files that confuse the loader on the next attempt. Set HF_HOME first, every time, on any system that isn’t your personal workstation.

Verify it works

python -m llava.serve.cli 
 --model-path liuhaotian/llava-v1.5-7b 
 --image-file "https://llava-vl.github.io/static/images/view.jpg" 
 --load-4bit

First run downloads the weights – 10 to 30 minutes depending on your connection. Type a question at the prompt. Response in a few seconds? GPU memory sitting at ~6-7 GB on nvidia-smi? You’re done. That’s a healthy install.

For a Gradio web UI instead of CLI, the repo includes a controller + worker + UI setup under llava/serve/. The README covers the three commands needed.

Common errors and the fixes that actually work

Error: ImportError: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEE... when importing flash_attn

Flash-attn was compiled against a different torch version than what’s currently installed. The fix (from flash-attention issue #1348):

pip uninstall flash-attn -y
pip install --no-build-isolation flash-attn==2.5.6 -U --force-reinstall

The catch: pinning flash-attn==2.5.6 can silently pull in a different torch version and re-trigger the exact same mismatch. Pin both. Install torch with the exact CUDA suffix you need first, then flash-attn with --no-build-isolation.

Error: OSError: CUDA_HOME environment variable is not set

Quick check: which nvcc. If that returns nothing, you’re either not in a conda environment or nvcc isn’t installed. Fix: conda install -c nvidia cuda-nvcc, or point CUDA_HOME at your system CUDA install manually.

Import errors after git pull

Re-run pip install -e . after any pull. New code, new dependencies – the editable install doesn’t auto-update. If flash-attn errors return, add --no-cache-dir to force a clean rebuild.

Upgrading and removing LLaVA

# Upgrade
cd LLaVA && git pull && pip install -e .

# Remove
conda deactivate
conda env remove -n llava
rm -rf LLaVA/
rm -rf "$HF_HOME" # only if you don't need weights for other HF models

That last rm -rf is the one people skip. The conda env removal leaves 25+ GB of weights sitting in your cache. If you’re keeping other HF models, go into $HF_HOME/hub/ and delete only the LLaVA-specific folders there.

A broader question worth sitting with before you go further: is self-hosting actually the right call here? Serving LLaVA 7B requires a machine that’s always on, VRAM you can’t use for other workloads, and maintenance time every time a dependency changes. For low-volume use cases, a hosted API might be cheaper in practice. Local deployment wins when data privacy matters, inference volume is high, or you need to fine-tune on proprietary images.

FAQ

Can I run LLaVA on a CPU only?

Yes, via llama.cpp, which supports LLaVA with quantized weights. Slow – users report anywhere from several seconds to over a minute per response depending on hardware – but workable on a decent laptop with no GPU.

Which version should I actually deploy in 2026 – LLaVA-1.5, 1.6, or OneVision?

For most people doing inference on photographs or screenshots: LLaVA-1.6 (same haotian-liu repo, swap the model name). The OCR and document-reading quality is noticeably better. LLaVA-1.5 if you specifically need the steps in this guide to match the codebase exactly and stability matters more than capability. OneVision-2 is a different install path entirely – only worth it if you’re working with video data or need to run training, not just inference.

Is LLaVA actually free for commercial use?

The code is Apache 2.0 – yes, commercial use allowed. But the weights are a separate question. They inherit the license of the base LLM (Vicuna/LLaMA) and the training data, which includes GPT-4-generated content under OpenAI’s terms. Many teams have been caught assuming “open source model” means the weights are also freely licensable. They’re often not. Check both the code license AND the specific checkpoint you’re loading before shipping anything to production.

Next step: Once your CLI test passes, swap liuhaotian/llava-v1.5-7b for liuhaotian/llava-v1.6-mistral-7b and run the same query on a document image with small text. That’s where the 1.6 OCR improvements actually show up.