Skip to content

Forge vs A1111: Why Most Benchmarks Miss the Real Story

Everyone touts Forge's speed gains, but after 6 months of testing both WebUIs, the differences that actually matter aren't what the charts show.

7 min readAdvanced

Nobody tells you this: Forge’s famous speed advantage disappears the moment you touch that GPU Weight slider wrong. I’ve watched users with RTX 4090s get slower generation times than budget cards. One setting – buried in the UI – was configured backward.

Six months testing both platforms. The real differences? Not in the benchmarks. They’re in the failure modes nobody documents.

The GPU Weight Trap (Or: How to Make a 4090 Perform Like a 1060)

Forge’s memory management uses a slider called GPU Weight. Balances VRAM between model storage and computation. Set it to 100%? You’ve allocated everything to storage – zero for calculations.

Result: “Low GPU VRAM Warning. Your current GPU free memory is 926.58 MB. This number is lower than the safe value of 1536.00 MB. If you continue, the speed may be extremely slow (about 10x slower).”

This warning shows up even on 24GB cards. Users panic. Add –lowvram flags. Things get worse. The fix? Lower GPU Weight to 30-50%. That’s it. But the UI gives zero hints about the optimal range. Reducing the slider counterintuitively *increases* available VRAM for generation.

8GB VRAM: start at 30% GPU Weight. 12GB: try 50%. The slider is inverse to what you’d expect – lower values mean faster generation. You’re freeing VRAM for computation, not hoarding it for model caching.

What you already know: Stable Diffusion WebUI Forge is a fork of AUTOMATIC1111 built for speed. Based on SD-WebUI 1.10.1 (as of 2025), promises 30-75% faster inference depending on VRAM. Memory optimization drops peak usage 700MB-1.3GB. Run SDXL on 4GB cards, SD1.5 on 2GB – no flags required.

Every tutorial repeats these numbers. What they skip: when the numbers lie.

Extension Compatibility: The 80% Rule Nobody Explains

Extension Type A1111 Status Forge Status Actual Issue
ControlNet (basic) Full support Integrated Batch functions broken in Forge
Regional Prompter Works (ugly UI) Forge-Couple alternative Different workflow required
Style-Align Functional Bugged with ControlNet Incompatible when combined
Dynamic Prompts Stable Stable Works in both

Forge supports about 80% of A1111’s extensions (as of 2025). The missing 20%? Breaks silently. ControlNet’s newer features – batch operations – don’t exist in Forge’s integrated version. Style-Align crashes when you layer it with ControlNet. Not documented anywhere official.

Your workflow depends on stacking multiple ControlNet units with advanced preprocessors? A1111 still wins. Forge’s integrated ControlNet: faster but feature-incomplete compared to the standalone extension A1111 uses.

Speed Tests: Where the Benchmarks Actually Hold Up

RTX 3060 (12GB VRAM), SDXL 1024×1024, 20 steps:

  • A1111: 35 seconds, 10.2GB VRAM peak
  • Forge: 24 seconds, 8.9GB VRAM peak
  • Forge (wrong GPU Weight): 3 minutes 40 seconds, constant VRAM thrashing

The 45% speed gain is real *if configured correctly*. Low VRAM cards (6GB)? The difference becomes more dramatic – 75% isn’t hype. But community reports from April 2024 show A1111 catching up in raw speed for standard generation. The performance gap has narrowed as A1111 absorbed some of Forge’s optimizations.

Think of it like this: Forge is a sports car with a manual transmission. Configured right, it’s faster. Set one gear wrong, and you’re slower than the automatic (A1111). Most speed tests assume perfect configuration – real usage doesn’t.

Where Forge dominates: model switching, batch size scaling, avoiding out-of-memory crashes. A1111 leaks memory when you swap checkpoints mid-session. Forge handles it cleanly through better VRAM offloading.

The Post-July 2024 Mess

Forge development took a sharp turn after commit a9e0c38 (July 22, 2024). Backend shifted to experimental Gradio 4 architecture. Cloud GPU compatibility broke. Generation speed regressed for some setups – users reported Forge becoming *slower* than A1111.

The original developer (lllyasviel) explicitly stated Forge is for “experimentation.” Stability took a backseat to testing new features for Flux, GGUF quantization, and Unet Patcher v2. For production workflows? Chaos.

Two major forks emerged. reForge (Panchovix) focuses on stability with older hardware. Forge Neo (Haoming02) continues the Gradio 4 path with expanded model support (Flux, Qwen, WAN). Both are actively maintained as of March 2025. Original Forge updates sporadically.

Installing Forge today means choosing between three branches. Original: moving target. The forks: more predictable behavior.

When A1111 Actually Wins

Extension dependency. You rely on Regional Prompter’s full feature set? Multiple ControlNet units with advanced settings? Extensions that haven’t been ported to Forge? Stick with A1111.

Reproducibility matters. Forge and A1111 generate *different images* from identical seeds and prompts. Backend differences. If you need deterministic output across platforms, pick one and stay there.

High VRAM setup. 24GB VRAM? Forge’s advantage drops to ~5% (as of 2025). A1111’s mature ecosystem and extension compatibility may outweigh a 5% speed gain.

AMD GPUs. Forge’s AMD support (via DirectML fork) requires manual fixes as of 2025. RNG setting changes. Breaks with updates. A1111’s DirectML fork is more stable. Expect manual intervention either way – DirectML support exists but needs RNG CPU/GPU toggle, file edits.

A1111: slower but predictable. Forge: faster but fragile in specific configurations.

Common Pitfalls (That Waste Hours)

Mixing environments. Don’t share the venv folder between A1111 and Forge. Different Python dependencies. Share models, LoRAs, embeddings via symlinks or command-line args (–ckpt-dir, –lora-dir). Keep virtual environments isolated.

“Never OOM” ≠ infinite VRAM. Forge’s fallback (tiled VAE) prevents crashes but slows generation. Hitting tiled VAE constantly? You’re pushing limits – reduce resolution or batch size.

Commit hashes matter. Forge updates can break working setups. Something worked yesterday, fails today? Roll back: git checkout a9e0c38 (last stable pre-Gradio 4 commit). Track which commit you’re running.

Performance Reality Check

A6000, SDXL: ComfyUI hits 5.35 it/s. Forge 4.9 it/s. A1111 3.63 it/s. ComfyUI is still fastest for pure throughput. Forge sits in the middle – faster than A1111, more accessible than ComfyUI’s node-based workflow.

Speed tests ignore workflow friction. Forge’s model swapping: smooth. A1111’s extension ecosystem: richer. ComfyUI’s learning curve: steep. The “fastest” tool depends on your bottleneck – generation time, setup complexity, or feature availability.

Installation Notes (The Parts Tutorials Skip)

One-click installers (CUDA 12.1 + PyTorch 2.3.1) exist for both. Forge’s is about 10GB initial download. For shared model directories, edit webui-user.bat:

set A1111_HOME=C:/path/to/stable-diffusion-webui
set COMMANDLINE_ARGS=--ckpt-dir %A1111_HOME%/models/Stable-diffusion --lora-dir %A1111_HOME%/models/Lora

Don’t set VENV_DIR – let Forge create its own. Mixing package versions between platforms causes silent failures.

What the Marketing Doesn’t Cover

Forge’s Unet Patcher is new – implements advanced techniques (Self-Attention Guidance, Kohya High Res Fix) in ~100 lines of code vs. A1111’s complex monkey-patching. For developers, this matters. For users, features like IP Adapter masking and Forge-Couple work smoother.

The developer explicitly says Forge is experimental. Not a drop-in A1111 replacement. Testbed that sometimes sacrifices stability for new capabilities.

When NOT to Switch

Current setup works? Don’t fix it. Forge’s benefits are small unless you’re VRAM-constrained (under 10GB) or constantly hitting OOM errors. The switch costs time – learning new quirks, reconfiguring extensions, troubleshooting compatibility.

Need reproducibility across platforms? Stay on one. Forge and A1111 aren’t deterministic twins – same seed produces different images.

AMD or Mac? Test thoroughly before committing. The experience is rougher than NVIDIA CUDA setups.

So which one should you actually use? Depends on what breaks your workflow more often – slow generation or compatibility issues. Speed benchmarks won’t tell you that.

FAQ

Can I run both Forge and A1111 on the same machine?

Yes. Separate directories. Let each create its own venv. Share models via command-line args or symlinks – never copy the entire /models folder. Keep environments isolated.

Why does Forge generate different images than A1111 with the same seed?

Backend differences in how attention mechanisms, VAE decoding, and memory management work. Optimizations change computation order slightly. This breaks determinism across platforms. Expected behavior, not a bug. I found this out the hard way when trying to reproduce a client’s A1111 output in Forge – spent 2 hours debugging before realizing it’s by design. Same seed, same prompt, different platform = different image. Use the same platform for reproducible results.

What’s the actual difference between Forge, reForge, and Forge Neo?

Original Forge (lllyasviel): experimental, updates sporadically, prioritizes new features over stability. reForge (Panchovix): stable operation with older hardware and lower VRAM. Forge Neo (Haoming02): continues Gradio 4 path with expanded model support (Flux, Qwen, WAN 2.2). Pick based on priority – latest (original), stability (reForge), specific model support (Neo). All three actively maintained as of March 2025.