You’ve written the perfect prompt. Stable Diffusion gives you a beautiful image – but the character’s pose is wrong, the composition doesn’t match your sketch, and the background depth feels flat. Prompts alone can’t solve spatial problems.
That’s the gap ControlNet fills. It lets you guide Stable Diffusion with reference images – poses, edges, depth maps, sketches – so the model generates what you actually need, not what it guesses from text.
What You Actually Need to Run ControlNet
ControlNet isn’t a standalone tool. It’s an add-on that attaches to Stable Diffusion through an interface like AUTOMATIC1111 WebUI or ComfyUI. Here’s what you need before starting.
Base requirements:
- Stable Diffusion already installed (A1111 WebUI, ComfyUI, or Forge)
- A GPU with at least 6GB VRAM (8GB+ recommended for SD 1.5 + ControlNet)
- The ControlNet extension installed in your interface
- ControlNet model files downloaded and placed in the correct folder
For A1111 users: Go to Extensions → Install from URL → paste https://github.com/Mikubill/sd-webui-controlnet → Install → Restart UI. ComfyUI has native support – no extension needed.
The Model Version Trap
Here’s where most tutorials stop, and where most users get stuck. You’ve installed ControlNet. You’ve downloaded models. You upload a reference image, click Generate, and… nothing. The output ignores your control image entirely.
The problem? Model version mismatch. ControlNet models are trained for specific Stable Diffusion versions – SD 1.5, SD 2.1, SDXL. If you’re running SD 1.5 but loaded an SDXL ControlNet model, it won’t work. And it won’t tell you. No error message. Just silent failure.
Check this before blaming your settings: Open your A1111 interface → top left dropdown → note which SD checkpoint you’re using (look for “sd15” or “sdxl” in the filename). Then check your ControlNet model names. SD 1.5 models usually have “sd15” in the filename. SDXL models say “sdxl.” They’re not interchangeable.
According to the official model wiki, SD 1.5 and SD 2.0 ControlNet models ARE compatible with each other – but nothing else crosses over. If you’re on SDXL, you need SDXL ControlNet models. Period.
Where to Actually Download Models That Work
ControlNet models come in three sizes, per the official documentation: LARGE (1.45 GB original), MEDIUM (723 MB fp16), and SMALL (136 MB LoRa). For most users, MEDIUM is the sweet spot – half the size, same quality.
Download locations:
- HuggingFace: lllyasviel/ControlNet-v1-1 (SD 1.5 models, v1.1)
- Civitai: ControlNet 1.1 safetensors (pruned, smaller files)
- For SDXL: Search Civitai for “controlnet sdxl” – community models only, no official release yet as of early 2025
After downloading, place the .safetensors or .pth files in:stable-diffusion-webui/extensions/sd-webui-controlnet/models (A1111)ComfyUI/models/controlnet (ComfyUI)
Some models require a .yaml config file with the same name. Download it if available and place it next to the model file.
Using ControlNet: The Workflow That Matters
Open your Stable Diffusion interface. Scroll to the ControlNet section (usually below the prompt box). You’ll see a collapsed panel – expand it. Here’s the process that actually works.
Step 1: Upload your reference image. Click the image canvas and select your control image – the pose reference, the sketch, the depth map, whatever you’re using to guide the generation. This is NOT the image you want to generate. It’s the spatial guide.
Step 2: Check the “Enable” box. ControlNet won’t activate unless this is checked. Obvious, but easy to miss.
Step 3: Choose preprocessor and model. This is where the magic – and the confusion – happens. The preprocessor extracts the control map from your reference image (edges, pose skeleton, depth, etc.). The model uses that map to guide Stable Diffusion.
Pro tip: Since ControlNet 1.1, preprocessor and model names match automatically. If you select “canny” as the preprocessor, choose a model with “canny” in the name. If you select “openpose,” pick a model with “openpose” in it. This was confirmed by the ControlNet author himself – no more memorization needed.
Step 4: Set your prompt. Yes, you still write a text prompt. ControlNet controls spatial layout. Your prompt controls style, content, and details.
Step 5: Generate. Click the Generate button. If everything’s set correctly, your output will follow the spatial structure of your reference image while matching your text prompt.
Settings That Break Things (and How to Fix Them)
Most ControlNet tutorials list every setting. I’m only covering the ones that cause actual problems.
Pixel Perfect vs Manual Resolution
The “Pixel Perfect” checkbox auto-calculates preprocessor resolution to match your generation size. Sounds great. Problem: if you ALSO manually set the “Preprocessor Resolution” slider, they conflict. The result? Distorted control maps that don’t align with your output.
Fix: Pick one. Either enable Pixel Perfect and ignore the resolution slider, or disable Pixel Perfect and set resolution manually. Don’t use both. As of early 2025, the A1111 interface doesn’t warn you about this.
Control Weight vs Control Mode
Control Weight (slider from 0 to 2, default 1.0) controls how strongly the control map influences the output. Higher weight = output follows the control map more closely. That part’s simple.
Control Mode is where it gets weird. Three options: “Balanced,” “My prompt is more important,” and “ControlNet is more important.” These sound like they just shift emphasis. They don’t. According to community testing and GitHub discussions, “ControlNet is more important” multiplies ControlNet influence by your CFG scale.
Translation: If your CFG scale is 7 and you select “ControlNet is more important,” ControlNet becomes 7 times stronger – on top of whatever you set the Control Weight to. This is NOT the same as just increasing Control Weight. If your control map is overpowering your prompt, drop Control Mode to Balanced before touching the weight slider.
Starting and Ending Control Step
These sliders range from 0 to 1, representing the percentage of generation steps where ControlNet applies. Default: 0 to 1 (full generation). Here’s the trap: these aren’t fade-in/fade-out ranges. They’re hard on/off switches.
If you set Starting Control Step to 0.5, the first 50% of generation happens WITHOUT ControlNet. Then it kicks in at step 50%. No gradual transition. Community reports confirm this causes jarring composition shifts mid-generation if you’re not expecting it.
Use case: Some users lower the Starting Step to let Stable Diffusion establish the overall composition first, then apply spatial control later. But 0.8 or 0.9 is more common for this than 0.5. Starting at 0 (full control from the beginning) is the safest default.
Which ControlNet Model for What
ControlNet isn’t one model. It’s a collection. Each one extracts different spatial information. Here are the ones that matter most, and when to use them.
Canny (Edge Detection)
What it does: Extracts hard edges from your reference image – outlines of objects, boundaries between light and dark areas. Great for retaining composition without copying style or color.
When to use it: You have a photo or sketch and want to keep the layout but change everything else. Example: photo of a person in a specific pose → anime character in the same pose.
Preprocessor: Select “canny” in the preprocessor dropdown. Two threshold sliders appear: Low Threshold and High Threshold. Lower values detect more edges (noisier). Higher values detect fewer edges (cleaner). Default (100/200) works for most images.
Depth (Spatial Layout)
What it does: Estimates depth – how far each part of the image is from the camera. Generates a grayscale depth map (white = close, black = far). Preserves 3D spatial relationships without locking down exact edges.
When to use it: You want the same spatial depth as a reference image but don’t care about exact outlines. Example: landscape photo with mountains in back, lake in middle, rocks up front → fantasy scene with the same depth layers.
Preprocessor: Multiple options – Midas, DPT, Leres. Midas is the most common. According to the official ControlNet repo, these preprocessors work at 512×512 resolution (SD 1.5), much higher than Stability AI’s 64×64 depth model from SD 2.0. That means more detail preserved.
Catch: Depth estimators struggle with top-down views (looking straight down) and abstract art. If your reference image has no clear foreground/background, depth maps won’t help.
OpenPose (Human Pose)
What it does: Detects human keypoints – head, shoulders, elbows, wrists, hips, knees, ankles. Generates a stick-figure skeleton. Best for copying human poses without copying anything else (clothing, face, background).
When to use it: You found a reference photo with the perfect pose but want a completely different character. Example: ballet dancer photo → robot in the same pose.
Preprocessor: “openpose” or “dw_openpose_full” (DWOpenPose is newer, better at detecting hands and fingers). The preprocessor outputs a skeleton image. If you want to manually edit the pose, enable “Allow Preview,” run the preprocessor, download the skeleton, edit it in an image editor, then re-upload it with preprocessor set to “none.”
There’s also “openpose_face” and “openpose_hand” for face-only or hand-only control. Useful for fixing faces or hands in an existing generation.
Scribble (Rough Sketch to Image)
What it does: Turns your rough sketch into a detailed image. You draw stick figures or basic outlines, ControlNet fills in the rest.
When to use it: You can’t find a reference image that matches your idea, so you draw it yourself (badly). ControlNet interprets your scribble and generates a real image.
Preprocessor: “scribble_hed” (extracts soft edges from an existing image) or “none” (if you’re uploading your own hand-drawn scribble).
The ControlNet Research Paper Context
ControlNet wasn’t a company product. It started as academic research. In February 2023, researchers Lvmin Zhang and Maneesh Agrawala published “Adding Conditional Control to Text-to-Image Diffusion Models” on arXiv. The core innovation: zero convolution layers.
Standard fine-tuning risks destroying a pre-trained model’s knowledge. ControlNet solves this by copying Stable Diffusion’s weights into a “locked” copy (never changed) and a “trainable” copy (learns spatial control). They’re connected via 1×1 convolution layers initialized to zero. At the start of training, these output zeros – so ControlNet adds nothing. As training progresses, the weights grow from zero, gently learning control without disrupting the base model.
This matters because it means ControlNet models can be swapped without retraining Stable Diffusion. You can use Canny today, OpenPose tomorrow, same base checkpoint. The locked/trainable architecture makes that possible.
What Nobody Tells You About Multiple ControlNets
A1111 supports stacking multiple ControlNet units – OpenPose for pose + Canny for composition + Depth for spatial layout, all at once. Sounds powerful. It is. But it’s also where things fall apart.
Each ControlNet you add multiplies GPU memory usage. SD 1.5 base model: ~4GB VRAM. One ControlNet: +700MB (roughly). Two ControlNets: another +700MB. Three? You’re over 6GB. If your GPU can’t handle it, generations crash or slow to a crawl.
To enable multiple ControlNets in A1111: Settings → ControlNet → “Multi ControlNet: Max models amount” slider → set to 2 or 3 → Apply and restart. More than 3 is overkill for most use cases.
When you stack ControlNets, their weights interact. If both are set to 1.0, one will dominate depending on the model type. OpenPose tends to overpower Canny. Depth is subtle. Start with lower weights (0.5-0.7) and test.
ComfyUI vs A1111: Does It Matter?
ComfyUI has native ControlNet support – no extension install needed. Models go in ComfyUI/models/controlnet, and you add ControlNet via nodes (Load ControlNet Model → Apply ControlNet). The workflow is more visual but requires understanding node connections.
A1111 is simpler for beginners. ComfyUI is faster for complex workflows (multiple ControlNets, chaining preprocessors). Both use the same model files. Pick based on your interface preference, not ControlNet capability.
When ControlNet Fails (and Alternatives)
ControlNet isn’t perfect. Depth estimators fail on flat, abstract images. OpenPose can’t detect non-human poses (animals, robots – though there’s an Animal OpenPose model now). Canny over-extracts edges from noisy photos.
If ControlNet isn’t giving you what you need, consider:
- T2I-Adapter: Lighter alternative to ControlNet (smaller models, faster inference). Same concept – spatial conditioning – but simpler architecture. Works in A1111 via the same extension.
- IP-Adapter: Style transfer without spatial control. Great for “make this image look like that image” without locking down composition.
- Img2Img with low denoising: If you just need minor adjustments, standard img2img (no ControlNet) at 0.3-0.5 denoising can be faster and simpler.
Start Here, Not with Canny
Most tutorials start with Canny edge detection. I think that’s backwards. Canny is precise – which means it’s unforgiving. Depth is more forgiving. Scribble is the most forgiving. If you’re testing ControlNet for the first time, start with Depth. It’ll give you a feel for how ControlNet influences output without requiring perfect reference images.
Download a Midas depth model. Upload any photo with clear foreground/background. Set Control Weight to 0.7. Write a prompt that describes what you want (not what’s in the photo). Generate. You’ll see how depth maps guide spatial layout while letting your prompt control the rest.
Once you understand that, move to OpenPose (if you work with characters) or Canny (if you need precise composition). But start with the model that’s most forgiving.
FAQ
Why doesn’t my ControlNet affect the output at all?
Three common causes: (1) Model version mismatch – your SD checkpoint is 1.5 but your ControlNet model is for SDXL. Check the filenames. (2) Enable checkbox is unchecked. (3) Control Weight is set to 0. Also, make sure you actually selected a model in the Model dropdown – “None” is an option and it does nothing.
Can I use ControlNet with SDXL models?
Yes, but you need SDXL-specific ControlNet models. SD 1.5 ControlNet models don’t work with SDXL checkpoints. As of early 2025, SDXL ControlNet models are mostly community-made (available on Civitai). Stability AI released official SD 3.5 Large ControlNets in January 2025 (Canny, Depth, Blur), free for commercial use under $1M revenue per the Community License.
What’s the difference between Control Weight and Control Mode?
Control Weight (0 to 2) is how strongly the control map influences the image – higher = more control. Control Mode changes HOW that influence applies. “Balanced” applies ControlNet normally. “My prompt is more important” weakens ControlNet slightly in favor of text. “ControlNet is more important” multiplies ControlNet influence by your CFG scale – so if CFG is 7, ControlNet becomes 7x stronger on top of the weight you set. This is non-linear and can overpower your prompt fast. Stick to Balanced unless you’re intentionally experimenting, then adjust weight first before touching Control Mode.