Skip to content

Fine Tune LLM Open Source: Install Axolotl v0.14.0 Guide

Step-by-step Axolotl v0.14.0 install guide for fine tune LLM open source workflows. Real commands, real errors, and the flash-attn build trap.

7 min readIntermediate

Most people trying to fine tune LLM open source models with Axolotl in late 2025 are following a tutorial from early 2024 – and the first command fails. The repo moved. The install switched from pip to uv. Transformers jumped from v4 to v5. If your error message mentions OpenAccess-AI-Collective, your guide predates the rename.

This walkthrough installs Axolotl v0.14.0 the way the maintainers currently recommend (as of mid-2025), then covers the three install failures that actually show up in the wild – not in the docs.

What changed in v0.14.0

The big one: Transformers v4 is gone. v0.14.0 pins Transformers v5 – a breaking change if you’ve been pinning older versions in your environment. Beyond that: scattermoe integration (cuts VRAM for Mixture-of-Experts models), EAFT support, and selectable MoE kernels (batched_mm, grouped_mm).

That Transformers jump is why half the Medium posts from 2024 silently produce dependency conflicts. The package resolves, installs, then breaks at runtime. Not on import – usually 20 minutes into your first training run.

Worth knowing: turns out Axolotl isn’t just a hobby tool. A 2025 arXiv paper on trapped-ion quantum compilers used it specifically for its Hugging Face compatibility and support for full FT, LoRA, and QLoRA – the same features you’re about to use.

System requirements – where most installs fail before they start

Per the official installation docs (as of mid-2025):

Component Minimum Recommended
GPU NVIDIA Ampere (A10, A100, RTX 3090) or AMD with ROCm 6.2+ A100 80GB, H100, or B200
Python 3.10 3.11
PyTorch 2.3.1 2.8 or 2.9
CUDA 12.8 12.8 (Hopper/Ampere) / 13.0 (Blackwell)

Two traps. First: flash-attention requires Ampere or newer for bf16 – Pascal cards (P40, P100) and Turing (T4, RTX 2080) fail with the default install. Second: Blackwell cards (B100, B200, RTX 50-series) need PyTorch 2.9.1 and CUDA 13.0. CUDA 12.8 cannot compile for sm_103a (B300). This isn’t a workaround – it’s a hard requirement per the official docs.

Think of the GPU requirement like a phone OS version. An app compiled for Android 14 won’t run on Android 10 no matter how you configure it – and similarly, flash-attn compiled for Ampere’s sm_80 architecture simply doesn’t have the code paths for Pascal’s sm_61. You can’t patch around it at the YAML level.

Install: uv method

v0.14.0 is uv-first. The install command from the official docs:

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# cu128 for Hopper/Ampere, cu130 for Blackwell
export UV_TORCH_BACKEND=cu128

uv venv
source .venv/bin/activate
uv pip install --no-build-isolation axolotl[deepspeed]

# Pull example configs
axolotl fetch examples
axolotl fetch deepspeed_configs

The --no-build-isolation flag isn’t optional. Skip it, and uv builds flash-attn against a fresh torch wheel that doesn’t match your venv – which gives you the undefined symbol error at runtime, not at install time. By then you’ve already walked away from the terminal.

Docker: skip the dependency math entirely

# Standard image (Hopper/Ampere)
docker run --gpus '"all"' --rm -it --ipc=host axolotlai/axolotl-uv:main-latest

# Blackwell-specific (B100/B200/RTX 50-series)
docker run --gpus '"all"' --rm -it --ipc=host 
 axolotlai/axolotl-uv:main-py3.11-cu130-2.9.1

On a fresh cloud GPU, Docker is the right call. The CUDA version is pre-matched. No build step. You’re training in under five minutes from first pull.

Pip: still works, more fragile

The PyPI page documents this path – it requires version-pinned build tools before anything else:

pip3 install -U packaging==26.0 setuptools==75.8.0 wheel ninja
pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]

The pinned versions matter. Newer setuptools breaks the flash-attn build silently – the install completes but flash-attn imports fail at training start. This is what every older tutorial gets wrong: they skip the pins, users get inconsistent failures, and nobody knows why.

Verifying the install

Before touching your own dataset, confirm the install with the smallest bundled example – a 1B parameter LoRA run that fits on most GPUs:

axolotl train examples/llama-3/lora-1b.yml

The config targets NousResearch/Llama-3.2-1B with adapter: lora. If training starts and loss decreases, the install is correct. If it errors immediately, you have an environment mismatch – check CUDA version alignment first.

Before any long run, use axolotl preprocess your_config.yml to tokenize the dataset separately. Sequence-length mismatches and template errors surface here in 30 seconds – not six hours in.

Something most 2024 guides missed entirely: as of v0.14.0, Axolotl ships documentation built specifically for AI coding agents. Run axolotl agent-docs sft or axolotl agent-docs grpo (or axolotl agent-docs --list to see all topics). Feed those into Claude Code, Cursor, or Copilot context and the agent suddenly understands your YAML schema without you explaining it.

Honestly, whether that agent-docs feature becomes the main interface for writing Axolotl configs is an open question – but for anyone already using AI-assisted coding, it’s the part worth experimenting with first.

Three errors that actually happen

flash-attn build hangs for over an hour

Reported on Ubuntu 22.04, CUDA 12.6, Python 3.11 (GitHub issue #2374): install stuck at “Building wheel for flash-attn” indefinitely. Two fixes: pin the build deps with the exact versions above, and export MAX_JOBS=4 before install (community workaround – limits parallel compilation so the build doesn’t exhaust available memory). Results vary by machine; this is not documented officially.

ImportError: flash_attn_2_cuda undefined symbol

GitHub issue #3142 – torch 2.6.0+cu124 with flash-attn 2.8.2. The wheel was compiled against a different torch ABI than the one in the environment. Fix: uninstall both, reinstall flash-attn with --no-binary :all: so it builds locally against your exact torch version.

RuntimeError on Pascal or Turing GPUs

P40, P100, T4, RTX 2080: flash attention won’t load. Per GitHub issue #1359, there’s no YAML flag that bypasses this cleanly – the architecture simply doesn’t support it. Options: install without the flash-attn extra, try setting flash_attention: false in your YAML (check whether your version recognizes this key), and accept the throughput penalty. Or rent an Ampere card by the hour.

Upgrading and removing

Upgrading from an older release:

uv pip install --upgrade --no-build-isolation axolotl[deepspeed]

The Transformers v5 jump in v0.14.0 will break configs that used deprecated v4 arguments – audit your YAML for renamed keys before the first run after upgrade.

Clean uninstall:

uv pip uninstall axolotl flash-attn deepspeed
rm -rf .venv ~/.cache/huggingface/hub
rm -rf ./outputs ./examples ./deepspeed_configs

The Hugging Face cache is the silent disk hog. Model weights from test runs live in ~/.cache/huggingface/hub until you delete them. A handful of QLoRA experiments on 7B models can consume over 100 GB – delete before you run out of space, not after.

FAQ

Should I use Axolotl or write my own Transformers Trainer script?

More than two experiments? Use Axolotl. Swapping models, datasets, or training methods is a YAML edit – not a refactor.

I have an RTX 4090 and want to fine-tune a 7B model. Will this work?

The 4090 is Ada Lovelace – post-Ampere – so flash-attn compiles cleanly with CUDA 12.8 and PyTorch 2.8 or 2.9. A 7B model at 4-bit QLoRA is a common community configuration for 24GB VRAM, though exact VRAM fit depends on your sequence length and batch size. Start with micro_batch_size: 1, sequence_len: 2048, and adapter: qlora. Adjust from there based on what your first run actually uses.

Is the old OpenAccess-AI-Collective GitHub URL still valid?

It redirects, but scripts that pin specific commit hashes from that org will fail. Update any references to axolotl-ai-cloud/axolotl.

Next step: run axolotl fetch examples, open examples/llama-3/lora-1b.yml, change the datasets.path to your own JSONL file, and launch a 50-step test run. If it completes on a small config, your full training job will too.