Open Source Voice Cloning: Install XTTS v2 (2026 Guide)

Deploy XTTS v2 for open source voice cloning using the maintained idiap fork. Install steps, the 250-char trap, and fixes for common errors.

Jordan West2026-05-097 min readIntermediate

Two paths lead to running XTTS in 2026, and only one of them works. Path A is what 90% of tutorials still tell you: pip install TTS. That’s the original coqui-ai/TTS repo – abandoned, capped at Python 3.11, and it’ll throw a RuntimeError the moment you try Python 3.12. Path B is the idiap fork, published as coqui-tts on PyPI. A fork of the original, unmaintained repository – with prebuilt wheels for macOS, Windows, and Linux, and a new PyPI package name to match.

Path B is the only one worth your time for open source voice cloning right now. This guide walks through deploying XTTS v2 via the maintained fork, plus the three install-time traps that eat hours if you don’t know they exist.

What you’re actually installing

XTTS v2 is the open source voice cloning model: 17 languages, voice cloning from a 6-second audio clip, and streaming with under 200ms latency according to the idiap README. The Python package coqui-tts is the inference and training toolkit around it. Latest version on PyPI is 0.27.5 as of early 2026.

The model itself is gated by a license click-through. It’s under the Coqui Public Model License (CPML) – first time you load XTTS v2 in any environment, it prompts you to type “y” to accept terms. Automated deployments need to handle this prompt or the loader hangs forever.

System requirements (the real ones, as of early 2026)

Forget the vague “any modern machine” advice. Here’s what’s actually tested:

Component	Minimum	Recommended
OS	Linux/macOS/Windows	Ubuntu 24.04
Python	3.10	3.11 or 3.12
PyTorch	2.2+	2.4+ with CUDA 12.x
RAM	8 GB (CPU only, painful)	16 GB
VRAM	4 GB (slow inference)	8 GB+ (real-time streaming)
Disk	~5 GB free	10 GB (model + cache)

The package is officially tested on Ubuntu 24.04 with Python ≥3.10, <3.15, and PyTorch 2.2+ – and should also work on Mac and Windows per the PyPI page. CPU inference works, but how slow is “slow”? Think batch-overnight slow, not interactive slow. The streaming flag assumes real-time-capable hardware; on CPU it’s more of a suggestion.

Here’s a question worth sitting with before you go further: do you actually need real-time streaming, or will batch generation do? Most open source voice cloning use cases – dubbing, audiobooks, content pipelines – don’t need the <200ms latency at all. That changes which hardware trade-offs matter.

Install coqui-tts (the right way)

I’ll show the uv-based install because it handles the PyTorch backend selection automatically. If you prefer plain pip in a venv, just drop uv from the commands.

# 1. Create an isolated env
python3.11 -m venv xtts-env
source xtts-env/bin/activate # on Windows: xtts-envScriptsactivate

# 2. Install uv (faster, smarter resolver)
pip install uv

# 3. Install PyTorch FIRST - this matters from 0.27.4 onward
uv pip install torch torchaudio torchcodec --torch-backend=auto

# 4. Install coqui-tts
uv pip install coqui-tts

Step 3 is the one nobody mentions. From coqui-tts 0.27.4, PyTorch is not included by default – you need to install it yourself first. Older guides skip this and you end up with a half-broken install that imports fine but crashes on first inference.

If you’d rather not deal with Python at all, there’s a Docker image. Run docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/idiap/coqui-tts-cpu and you get a working TTS server without local installation. The GPU image is also published – see the idiap repo for the CUDA tag.

First run: the license prompt and the 2GB download

Create a 6-10 second reference WAV. Format matters more than people admit – per the SillyTavern XTTS docs, the file should be PCM, mono, 22050Hz, 16-bit. Convert via Audacity if your source is anything else. Skip this and the cloned voice sounds like it’s coming through a tin can.

Then a minimal Python script to test:

from TTS.api import TTS
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

tts.tts_to_file(
 text="This is a test of open source voice cloning.",
 speaker_wav="reference.wav",
 language="en",
 file_path="output.wav"
)

First run does two things people aren’t ready for. It prompts you to accept the Coqui Public Model License (type y). Then it downloads about ~2 GB of model weights to your local cache. On a slow connection that’s coffee-and-a-walk territory.

Heads up for container deployments: Pre-bake the model into the image so the license prompt never blocks startup. The environment variable COQUI_TOS_AGREED=1 is widely used for automated acceptance – check the current idiap README to confirm it’s still supported in 0.27.5, as env variable behavior has changed between releases.

Verify it actually works

Three quick checks:

Version: python -c "import TTS; print(TTS.__version__)" – should print 0.27.5 or whatever you installed
Model list: tts --list_models | grep xtts_v2 – confirms the model is registered
Inference: run the script above. Output WAV should be a few hundred KB and play back the cloned voice

If output.wav is suspiciously short or cuts off mid-sentence, you’ve hit the silent character limit – covered next.

The three errors that will eat your evening

1. Numpy 2.x breaking everything

Symptom: ImportError: numpy.core._multiarray_umath failed to import or a wall of pip dependency conflicts. Turns out numpy 2.0.0 collides with the whole dependency stack at once – tts 0.21.1, gruut, numba, and gradio all break simultaneously, per Discussion #3793 on the original repo. Fix:

pip install "numpy<2.0"

2. Model re-downloads every run

You watch 2 GB pull down, generate audio, restart the script – and it pulls 2 GB again. This is tracked as Issue 4723 on GitHub. Usually a cache path mismatch or permissions issue. The cache should live at ~/.local/share/tts/ on Linux or %LOCALAPPDATA%tts on Windows – check that your user owns it and the path hasn’t changed between runs.

3. Audio truncation with no error

The 250-char cap catches everyone eventually. Long input text produces audio that just stops – XTTS logs “The text length exceeds the character limit of 250 for language […], this might cause truncated audio” as a warning only, so it’s easy to miss in output. The cap is per-language and per generation call. Workaround: split your input into sentence chunks under 250 chars and concatenate the WAV outputs. The split_sentences=False parameter gives you manual control over chunking if you want it.

Upgrade and uninstall

Upgrade is the boring kind:

uv pip install --upgrade coqui-tts

If you’re coming from the old TTS package, uninstall it first or you’ll get import shadowing weirdness:

pip uninstall TTS
uv pip install coqui-tts

Full removal – package, model cache, the lot:

pip uninstall coqui-tts
rm -rf ~/.local/share/tts # Linux
# Windows: rmdir /s %LOCALAPPDATA%tts

The cache directory is what holds the 2 GB of weights. Skip the rm step and you’ve reclaimed the package but kept the heavy stuff sitting on disk forever.

FAQ

Is XTTS v2 free for commercial use?

No. The Coqui Public Model License restricts commercial use – read the full terms at coqui.ai/cpml.txt before shipping anything paid. The code (MPL 2.0) is fine; the model weights are the catch.

Can I run XTTS on CPU only?

Yes, but expect it to be significantly slower than a modern GPU – slow enough that interactive use becomes impractical. Fine for batch generation that runs overnight. If you’re CPU-only, skip the streaming flags entirely; they assume real-time-capable hardware that CPU can’t provide.

Why does my cloned voice sound robotic?

Almost always a reference audio problem. The 6-second minimum from the docs is the floor, not the target – give it 8-12 seconds of clean speech with no music, no background noise, and no compression artifacts. Re-encode the reference to mono 22050Hz 16-bit PCM, retry, and the difference is usually night and day. Quality of the reference file matters more than its length beyond the minimum.

Next step: grab a 10-second WAV of your own voice, run the verification script above, and listen. If it sounds 80% right out of the gate, fine-tuning on a larger dataset will close most of the remaining gap – that’s a separate guide.