Skip to content

Stable Diffusion LoRA Training: A Beginner’s Guide That Skips the Fluff

A practical Stable Diffusion LoRA training guide for beginners - Kohya vs cloud, real VRAM costs, and the settings that actually move the needle.

8 min readAdvanced

Two paths exist for training a Stable Diffusion LoRA: install Kohya_ss locally on a decent NVIDIA GPU, or rent a cloud GPU and run a notebook. For most beginners, local Kohya_ss wins – but only if you actually have 12GB+ VRAM. Below 12GB, the cloud route saves you days of fighting with out-of-memory errors. Everything else in this Kohya_ss LoRA training guide assumes you’ve picked the path that matches your hardware.

The takeaway, upfront

Use Kohya_ss locally if you own an RTX 3060 12GB or better. Use a cloud notebook (RunPod, MassedCompute, or a Colab fork) if you’re on 8GB or sharing a laptop GPU – cloud instances run a few dollars an hour and skip the dependency headache entirely. As of early 2025, SDXL LoRA training needs at least ~10GB VRAM with aggressive optimization turned on; comfortable training really starts at 16-24GB (per the PropelRC Kohya settings breakdown). Anything less and you’re either renting compute or waiting two days for one epoch.

What LoRA actually changes – and where it quietly breaks

LoRA injects small low-rank matrices into the cross-attention layers of the base model. The base weights stay frozen. Output files land between 2 and 200 MB, which is why you can hoard fifty of them without filling a drive.

The catch nobody mentions in beginner guides: a LoRA is not equivalent to fine-tuning, even when outputs look similar. Turns out the weight matrices LoRA produces contain what MIT CSAIL researchers call “intruder dimensions” – new high-ranking singular vectors that full fine-tuning never generates (Shuttleworth et al., arXiv:2410.21228). Practically, this means a LoRA can nail your target concept while subtly distorting unrelated knowledge in the base model. It’s why your face LoRA sometimes wrecks the way the model draws hands.

Kohya_ss locally vs. cloud notebook

Both paths use the same underlying training scripts (kohya-ss/sd-scripts, also the basis of the Hugging Face PEFT library’s LoRA tooling). The real difference is friction and total cost.

Factor Kohya_ss local Cloud notebook
Upfront cost $0 if you own the GPU A few dollars/hr for A6000 or A100
VRAM floor (SDXL) 10GB with tricks, 16GB sane Whatever you rent (24GB+ easy)
Setup time 30-90 min (Python, CUDA, deps) 5 min, preconfigured templates
Iteration speed Instant – your GPU sits there Slowed by upload + spin-up
Best for Repeat trainers, tuning loops One-off projects, weak GPUs

The deciding factor isn’t price – it’s how often you’ll train. One LoRA? Rent. Ten LoRAs over a year? Buy or borrow a GPU.

There’s also a question nobody phrases directly: how much does rank actually matter for your use case? A rank-32 LoRA trained on 20 clean images will almost always outperform a rank-128 LoRA trained on 60 blurry ones. The number is less important than the dataset. That said, rank and alpha interact in ways that aren’t obvious upfront – which is why the settings below explain the reasoning, not just the value.

The Kohya_ss walkthrough that actually works

Skipping the screenshots. Here’s the sequence that gets a working SDXL LoRA on the first try.

1. Dataset folder structure

Make a folder named 3_yourtoken. The number is the repeat count per epoch. Kohya reads the repeat rate (and optional class token) directly from the folder name – you don’t set it anywhere in the UI. Drop 15-25 images inside, each paired with a same-named .txt caption file.

2. Captions – the trigger-word trick

Pick a rare token (skw or ohwx are popular choices) that the base model doesn’t already know. Then watch where you place it. According to the ViewComfy training guide, position is meaning: “photo of skw man wearing a suit” binds the token to the man, while “photo of a man wearing an skw suit” binds it to the suit. Most tutorials show captions but don’t explain why this matters.

3. The settings that actually move quality

Base model: sd_xl_base_1.0_0.9vae.safetensors
Network rank (dim): 32
Network alpha: 16 # half of rank - smoother style transfer
Optimizer: AdamW8bit
Learning rate: 1e-4 (SDXL standard) or 3e-5 (conservative)
LR scheduler: cosine
Max resolution: 1024,1024
Mixed precision: bf16
Gradient checkpointing: ON
Cache latents to disk: true
Save every epoch: true

As of early 2025: setting alpha to half the rank (e.g., rank 32, alpha 16) is the preferred default for smoother style transfer without blowing out fine detail. AdamW8bit cuts VRAM 25-30% with negligible quality loss – use it on any GPU running 16GB or less (per the PropelRC Kohya guide). For each run, save a checkpoint every epoch and test epochs midway through instead of assuming the last one is best. The “goldilocks” version is usually 2-3 epochs before the model locks onto your subject and starts ignoring new prompts.

4. The 10GB VRAM trick

If you have 10-12GB and SDXL keeps running out of memory, enable the fused backward pass. As of Kohya_ss 0.9.0+, this integrates the optimizer’s backward and step operations into a single pass – dropping SDXL training memory from roughly 24GB to about 10GB at bf16 precision. The catch: it only works with the Adafactor optimizer and PyTorch 2.1+. AdamW8bit users get nothing from enabling it, which is the single most-skipped detail in beginner tutorials covering this setting.

Edge cases tutorials don’t tell you about

Concept bleed from the wrong base model

Train on top of Pony Diffusion or Juggernaut instead of vanilla SDXL 1.0, and your LoRA inherits their biases. The vife.ai 2025 guide flags this specifically: training on highly stylized custom checkpoints causes concept bleeding – the LoRA picks up the base’s style as though it were part of your subject. Always train against the official base unless you specifically want that style baked in.

The 16GB-isn’t-actually-enough trap

Even 16GB VRAM can’t train a rank-16 SDXL LoRA at batch size 2 without gradient checkpointing enabled. Disable it and memory spills into RAM, pushing total occupied memory past 20GB (reported in kohya_ss GitHub discussion #2594). The “minimum VRAM” numbers in most guides assume every optimization is already on. If you fork a tutorial that says “works on 16GB,” check whether gradient checkpointing is in their config.

Flux LoRA training

Different problem space entirely. Flux LoRA training ideally wants 24GB VRAM; it uses a T5xxl text encoder, and training that encoder is heavy enough that most users freeze T5 and train only the transformer (vife.ai 2025 guide). The Kohya settings above don’t translate cleanly to Flux – use FluxGym instead.

What the research actually says about LoRA quality

Databricks researchers found – and this is worth sitting with – that LoRA substantially underperforms full fine-tuning when used at commonly recommended low-rank settings. In continued pretraining, the gap doesn’t close even at higher ranks (Biderman et al., “LoRA Learns Less and Forgets Less,” arXiv:2405.09673). For image diffusion the stakes are lower than benchmark scores, but it explains something concrete: why your LoRA sometimes forgets how to draw objects that weren’t in your training set. That’s not a settings problem. It’s a structural property of low-rank adaptation.

How to know your LoRA is done

Load every saved epoch in AUTOMATIC1111 or ComfyUI. Run the same prompt and seed across all of them. The keeper is the epoch where likeness is high but the model still follows new prompts – changes the background, swaps clothing, adjusts pose. If clothing won’t change no matter what you prompt, you’ve overcooked it. Pick an earlier checkpoint.

FAQ

Can I train a Stable Diffusion LoRA without a GPU?

Yes – rent one. Colab Pro, RunPod, and MassedCompute all expose A6000 or A100 instances by the hour. The free Colab tier sometimes works for SD 1.5 but rarely holds up for SDXL.

How many training images do I really need?

For a face: 15-25 well-varied shots beats 100 similar ones. The bottleneck is angle and lighting diversity, not count. For an art style it’s different – you need the style applied to many subjects so the model generalizes it rather than memorizing specific compositions. Around 30-60 images is the community sweet spot for styles. More images aren’t better if they all look the same; the model just learns the duplicates.

Why does my LoRA distort faces in the background of generated images?

That’s the intruder-dimension effect described earlier. Your LoRA shifted the model’s attention layers toward your trigger token even where it shouldn’t apply. Try lowering the LoRA weight at inference, or retrain with a lower alpha and more varied backgrounds in your dataset – both reduce how aggressively the LoRA overrides the base model’s behavior.

Your next step

Pick 20 images right now. Caption them with a rare trigger word in the position that matches what you want to bind. Put them in a folder called 3_yourtoken. That’s the entire prep – the settings above handle the rest.