The end state you’re after: a folder with 200 images on disk, each one a clean variation on a single concept, generated overnight while you slept, with metadata you can re-trace. No babysitting. No clicking Generate 200 times. That’s what batch generation in Stable Diffusion is actually for, and the difference between doing it well and doing it badly is a few specific settings.
This guide walks backward from that finished folder through three pipelines – A1111’s prompts from file script, ComfyUI’s batch nodes, and the diffusers Python API – and stops to flag the gotchas that ruin overnight runs. Every number below comes from official docs or reproducible community reports; benchmarks are dated where possible since GPU driver updates and model loading changes can shift these figures.
The two knobs everyone gets wrong: batch size vs batch count
Batch size = parallel. Batch count = sequential. That’s the whole story, but the consequences of mixing them up are expensive. Batch size generates multiple images simultaneously using more VRAM; batch count runs each iteration one after another and frees GPU memory in between. Set batch size to 8 on a 6GB card at 768×768 and you’ll hit CUDA OOM before the first denoising step finishes.
With batch count, memory is freed and reused after each iteration, so your VRAM usage doesn’t grow with the count. With batch size, every image lives in VRAM simultaneously – so your ceiling is hardware, not patience.
| Setting | Parallel? | VRAM scales with it? | Best for |
|---|---|---|---|
| Batch size | Yes | Yes | Speed when you have VRAM headroom |
| Batch count | No | No | Big overnight runs on small GPUs |
Speed gap: real, but conditional. On a 3090 (as of community testing in 2023), generating 64×64 images at batch count 8 took 45.4 seconds; batch size 8 took 5.9 seconds – 7.6× faster, per A1111 discussion #14784. But that assumes the GPU compute unit isn’t already maxed out. A 2080Ti user in discussion #8930 reported doubling batch size from 8 to 16 produced a straightforward 2× time increase with zero parallelization gain – GPU memory had room to spare; the compute unit didn’t.
Pipeline 1: A1111 with prompts from file (the real batch tool)
Most tutorials show you the batch size slider and stop. That’s not real batching – it runs one prompt N times. Real batching means different prompts per image, with different settings per line.
The built-in Prompts from file or textbox script does exactly that. Drop down the Script menu at the bottom of txt2img, pick it, and paste lines like this:
--prompt "a red fox in a snowy forest, cinematic" --steps 30 --width 768 --height 512 --seed 1
--prompt "a blue heron at dawn, oil painting" --steps 25 --width 768 --height 512 --seed 2
--prompt "a barn owl on a fencepost, photoreal" --steps 35 --cfg_scale 8 --seed 3
Each line is one generation with its own settings. The A1111 features wiki lists the supported per-line flags: --prompt, --negative_prompt, --steps, --cfg_scale, --width, --height, --batch_size, --n_iter, --seed, --sampler_name, and others. That’s enough to drive a queue of hundreds of distinct generations from one text file.
Pro tip: Want variation around one base prompt without writing every line by hand? sd-dynamic-prompts adds wildcards so
{red|blue|green} foxexpands into three prompts automatically – useful when you want 50 outputs that differ in one slot.
Pipeline 2: ComfyUI
Batching in ComfyUI lives inside the graph itself. The Empty Latent Image node has a batch_size field – set it to 10, click Run, get 10 images. That’s the simple case, per the ComfyUI docs.
ControlNet or high-definition restoration workflows don’t start from an empty latent. The Latent From Batch node handles that – it splits a batched latent so the rest of your graph processes each item independently. For programmatic batches, ComfyUI also exposes an HTTP API: save your workflow in API format (Settings → Enable Dev mode options), then POST it in a loop with mutated seed values. That’s how cloud render farms drive thousands of generations per hour.
Pipeline 3: the diffusers API
Building a dataset or running A/B prompt tests? Neither GUI is the right tool.
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", # verify current checkpoint on HF Hub
torch_dtype=torch.float16
).to("cuda")
pipe.enable_attention_slicing()
prompts = ["a red fox", "a blue heron", "a barn owl"] * 10
images = pipe(prompts, num_inference_steps=25, guidance_scale=7.5).images
for i, img in enumerate(images):
img.save(f"out_{i:03d}.png")
Numbers from the Hugging Face diffusers docs (measured on a T4, as of v0.17.1 – current performance may differ): fp16 plus attention slicing gets a batch of 8 images to roughly 3.5 seconds per image. The guidance_scale default of 7.5 lands in the 7-8.5 range the docs recommend as a starting point. Stable Diffusion can process batches efficiently because, as the Hugging Face SD blog explains, diffusion runs in a compressed latent space rather than pixel space – which is why VRAM grows with batch size but not catastrophically fast.
Three things that quietly kill overnight runs
All three are documented. Almost none of the tutorials mention them.
- The img2img batch-size-4 VRAM bug. A confirmed A1111 bug (issue #11174, still present as of mid-2023): img2img with batch size ≥4 consumes all available VRAM, even on 24GB cards. Workaround is to keep batch size below 4 and use batch count instead. If your overnight run mixes txt2img and img2img and you set batch size = 4 across the board, the img2img portion will OOM. Split your queue.
- LoRA only applies to the first prompt in a batch. The official wiki is explicit: a batch with multiple different prompts uses only the LoRA from the first prompt. This fails silently – no warning, no error. If you’re driving 100 prompts through Prompts from file and each line has its own
<lora:xxx:1>, only the first one actually fires. Fix: keep one LoRA per batch run, or switch to ComfyUI where each node has its own Load LoRA. - You can’t cleanly reproduce a single image from a batch. Discussion #9476 confirms it: regenerating with the same seed but batch size 1 does not match the output from batch size 4 at that position. The image is close, not identical. If image #3 from your run is the keeper, save the PNG with embedded metadata – re-running the whole batch at the same batch size is the only way to get it back exactly.
Worth sitting with for a moment: the promise of overnight batch generation is automation, but these three bugs mean you can lose a night of compute to a silent failure. Fifteen minutes of test runs – 10 prompts, check VRAM, check LoRA output, spot-check a seed – is a better investment than discovering at 7am that 200 images rendered without the style you wanted.
Which pipeline fits your actual use case?
50-500 varied prompts with mild setting differences? A1111’s prompts-from-file is the fastest path – no setup overhead, runs tonight. ControlNet-driven batches or multi-LoRA work where the same graph applies with one variable changing? ComfyUI earns its learning curve there. Dataset generation, automated pipelines, or anything beyond a few thousand images? The diffusers API plus a for-loop beats both UIs – you get checkpointing, error handling, and no GUI overhead.
The GUI tools are built for exploration. The API is built for production. When you find yourself fighting either GUI to do something in bulk, that’s the signal.
FAQ
What’s the optimal batch size for an RTX 3060 with 12GB?
At 512×512, batch size 4 works. Push to 768×768 and drop to 2. Latent diffusion keeps VRAM manageable because it operates in compressed latent space, but the scaling is still real – when in doubt, go lower on batch size and higher on batch count.
Can I queue different checkpoints in a single batch?
Not with A1111’s prompts-from-file script – checkpoint isn’t one of the per-line flags. The practical workaround most people miss: run two separate batch files back-to-back using a shell script with &&, so the second run starts automatically when the first finishes. No babysitting required. ComfyUI handles checkpoint-switching natively via multiple Load Checkpoint nodes routed through a switch; the diffusers API just means swapping pipe in your loop.
Why does my batch produce nearly identical images even with different seeds?
CFG too high is usually the culprit – values above 12 collapse the model toward one interpretation of the prompt, and seed variation barely matters at that point. Drop to 7.5 (the default) and run a quick 4-image test. If images still converge, the prompt itself is over-constrained. Remove adjectives one at a time until variance comes back. Seeds are not magic randomness; they’re starting noise, and a domineering prompt overrides them.
Next: open A1111, drop the Script selector to Prompts from file or textbox, paste 10 test lines with different --seed values, and time the run. That’s your baseline – every optimization in this guide builds on top of it.