Consistent Character Posing in ComfyUI with Wan VACE

If you’re generating AI characters and want consistent character posing without training a LoRA, the Wan VACE workflow in ComfyUI is the cleanest solution available right now. A community member on r/comfyui built it using Wan 2.1’s video conditioning model as the backbone — and the results beat Flux Kontext for this specific use case. This guide covers everything: model downloads, workflow setup, configuration, and the fixes for every error you’ll likely hit along the way. Tested across GPU sizes from 8 GB to 16 GB VRAM, it runs on hardware most ComfyUI users already have.

🔍 What is the Wan VACE Posing Workflow?

Wan 2.1 VACE (Video-Aware Conditioning Enhancement) is a video generation model built for frame-consistent output. The insight behind this workflow is that Wan VACE’s conditioning mechanism — which keeps characters consistent across video frames — is exactly what you need for consistent character posing across different poses.

The workflow takes two inputs: a reference character image and a set of OpenPose skeleton images. The VACE model treats the reference as a conditioning frame and generates your character in each pose, using the same frame-consistency mechanism it uses for video. The video output is then sampled at set intervals to extract clean, static pose images.

The result is a batch of 3 images matching your reference character across multiple poses in a single generation run. On an NVIDIA A4000 with 16 GB VRAM, that takes 40–50 seconds using the optimized LightX2V distilled model. The method requires no LoRA training, no fine-tuning, and no IP-Adapter — just a reference image and some OpenPose skeletons.

⚡ Why Use This Workflow?

✅ No LoRA training required — works directly from a single reference image
✅ Faster than Flux Kontext for consistent character posing
✅ Runs on 8 GB VRAM with a quantized model
✅ Works with photorealistic, anime, stylized, and 3D characters
✅ Supports front, back, and side views
✅ Chains cleanly into img2img for final quality refinement
✅ Accepts OpenPose, depth, or canny maps as control inputs

✅ Step 1 – Download the Required Models

Before loading the workflow, you need three files. Download each one and place it in the correct ComfyUI folder.

📋 System Requirements

	Minimum	Recommended
GPU VRAM	8 GB	16 GB+
RAM	16 GB	32 GB
Storage	30 GB	50 GB
OS	Windows 10	Windows 11 / Linux

🧠 Wan 2.1 VACE GGUF Model

The fastest option is the LightX2V distilled model from QuantStack on HuggingFace. Download the quantization that fits your VRAM:

8–12 GB VRAM: Wan2.1_T2V_14B_LightX2V_StepCfgDistill_VACE-Q4_K_M.gguf
12–16 GB VRAM: Wan2.1_T2V_14B_LightX2V_StepCfgDistill_VACE-Q5_K_M.gguf
16 GB+ VRAM: Wan2.1_T2V_14B_LightX2V_StepCfgDistill_VACE-Q8_0.gguf

Source: QuantStack/Wan2.1_T2V_14B_LightX2V_StepCfgDistill_VACE-GGUF

Place the file in: ComfyUI/models/diffusion_models/

🎨 Wan 2.1 VAE

Download the VAE from the Comfy-Org repackaged repository.

Source: Comfy-Org/Wan_2.1_ComfyUI_repackaged — split_files/vae

Place it in: ComfyUI/models/vae/

📝 Text Encoder (CLIP)

This is the most common failure point in this workflow. Download the _scaled_ version — not the _enc_ version. The filenames look almost identical and using the wrong one causes a cryptic matrix multiplication error.

Correct file: umt5_xxl_fp8_e4m3fn_scaled.safetensors
Direct download: Comfy-Org/Wan_2.1_ComfyUI_repackaged — text_encoders

Place it in: ComfyUI/models/clip/

⚠️ The wrong encoder (umt5-xxl-enc-fp8_e4m3fn.safetensors) looks nearly identical in name. If you download the wrong one and get a mat1 and mat2 shapes cannot be multiplied error, this is the cause. See the troubleshooting section below.

✅ Step 2 – Install Required Custom Nodes

Open ComfyUI Manager and install the following node packs. The easiest method is to load the workflow first, then use “Install Missing Custom Nodes” — it detects most dependencies automatically.

Required nodes:

ComfyUI-GGUF — enables loading .gguf quantized model files
ComfyUI-VideoHelperSuite (VHS) — handles video loading and frame extraction
ComfyUI-WanVideoWrapper — provides the WanVaceToVideo and torchcompileModelwanVideoV2 nodes

After installing, restart ComfyUI completely before loading the workflow.

💡 Tip: If you hit triton/torch/CUDA errors after the first run, right-click the torchcompileModelwanVideoV2 node and select Bypass, then click “Update All” in ComfyUI Manager, then restart ComfyUI. This resolves the issue on most Windows setups without needing to install triton manually.

✅ Step 3 – Load the Workflow

Download the workflow from Civitai:

Consistent Character Posing Workflow

A backup JSON copy is available on Pastebin at pastebin.com/4QCLFRwp in case Civitai takes it down.

Open ComfyUI and drag the JSON file onto the canvas, or use Load from the top menu. After the workflow loads, open ComfyUI Manager and click Install Missing Custom Nodes to resolve any remaining dependencies. Restart ComfyUI when prompted.

✅ Step 4 – Set Up Your Reference Image and Poses

The workflow has a “To Configure” group containing the inputs you’ll change for each run.

Reference character image: Load your character image here. Full-body references give the best results. If your reference is cropped at the waist, your pose skeletons need to match that framing — the model can’t infer body parts that aren’t visible in the reference.

Pose images: The workflow ships with 3 pose slots. You can use existing OpenPose skeletons from the OpenPoses Collection on Civitai, or extract a skeleton from any reference photo using the online OpenPose Editor. Upload your image to the editor and click Generate to download the skeleton.

💡 Tip: In the “Pose to Video” group, set the image resize method to pad instead of fill/crop. The pad mode preserves the full skeleton by adding empty space to match dimensions. fill/crop cuts off limbs when the skeleton doesn’t match the exact input aspect ratio, which causes the character to lose body parts in the output.

All pose images must share the same pixel dimensions. The resize node handles this automatically once the method is set correctly.

✅ Step 5 – Configure the Generation Settings

The default settings are tuned for the LightX2V distilled model. Here’s what each key control does:

Steps: Defaults to 4. This is correct for the distilled LightX2V model. If you’re using a standard Wan 2.1 VACE model (not the distilled one), increase to 20–30. Pushing to 6–8 with the distilled model also improves output quality at moderate speed cost.

CFG: Defaults to 1. Correct for the distilled model — don’t change this unless you switch to a standard VACE model, which needs 6–7.

WanVaceToVideo strength: Defaults around 1.02. A value between 1.10 and 1.25 gives stronger pose adherence and keeps the character closer to the reference. Going below 1.0 gives the model more creative freedom but reduces how precisely the character follows the skeleton shape.

Image resize value: Controls output resolution. The default of around 512 is fast but soft. Increasing to 700–750 gives noticeably sharper detail — especially faces. Above 900 starts consuming significant extra VRAM without proportional quality gain.

Image repeat: Controls how many video frames are generated per pose internally. Higher values give the VACE model more room to transition between very different poses. If you’re jumping from a standing pose to an extreme pose like a floor pose, increasing image repeat helps the transition look correct.

✅ Step 6 – Generate and Post-Process

Click Queue Prompt. The workflow generates a short internal video, then samples the nth frame for each pose slot, outputting 3 still images.

The raw output is typically small (around 512 px). For final quality, chain the output images into a standard img2img workflow at 0.3–0.5 denoise using whichever checkpoint you used to generate the original character. This sharpens facial detail and upscales to your target resolution without changing the pose or character identity.

💡 Tip: Before the img2img step, connect your output images to an image resize node and scale up to 1000×1000. Then pipe that into VAE encode → img2img KSampler → VAE decode. The resize before encoding gives the img2img model more spatial detail to work with and produces significantly sharper final images, especially for faces.

In practice, the middle image from the 3-image batch is often the strongest — it benefits from both neighboring frames during internal video generation. Worth noting which pose slot consistently gives you the best result with your reference.

🛠️ Troubleshooting

Error	Cause	Fix
`mat1 and mat2 shapes cannot be multiplied (77x768 and 4096x5120)`	Wrong text encoder — using the `_enc_` variant instead of `_scaled_`	Download `umt5_xxl_fp8_e4m3fn_scaled.safetensors` (no `_enc_` in filename)
Triton/torch/CUDA errors at KSampler	Triton not installed or node is outdated	Bypass `torchcompileModelwanVideoV2`, run “Update All” in Manager, restart ComfyUI
`'float' object cannot be interpreted as an integer`	Wrong text encoder	Switch to the `_scaled_` text encoder variant
Output recolors or redraws the skeleton image	VACE conditioning too weak	Increase WanVaceToVideo strength to `1.10–1.25`; increase steps to 6–8
Character becomes slimmer or taller than reference	Model extends limb proportions to match skeleton bone lengths	Increase WanVaceToVideo strength, or run img2img post-process with original reference
Poses getting cropped at edges	Resize method set to `fill/crop`	Change to `pad` in the Pose to Video group
VHS node suite fails to load	Incompatible or outdated version	Reinstall via ComfyUI Manager with latest version
Output is blurry despite correct settings	Image resize value too low	Increase resize to `700–750` in the “To Configure” group

💡 Tips & Best Practices

💡 Tip: Place your most dramatically different pose last in the sequence. The VACE model transitions between frames internally, so jumping from standing to a floor pose in one step is harder than building up with intermediate poses that gradually get closer to the final one.

💡 Tip: Use the OpenPose Editor to extract skeletons from photos of the actual pose you want, rather than relying only on pre-made collections. Upload a photo of someone doing the pose, extract the skeleton, then use that directly in the workflow.

💡 Tip: For bone-length mismatches — when the skeleton’s proportions don’t match your character’s body — run an img2img ControlNet pass on your character first using the target pose at high denoise (0.75). Extract the pose from that result. The extracted skeleton will better match your character’s actual proportions.

💡 Tip: The torchcompileModelwanVideoV2 node speeds up generation but is genuinely optional. Bypassing it costs some speed but eliminates a whole category of environment-specific compilation errors. If you’re on Windows without a custom Python environment, bypass it by default.

💡 Tip: If you need more than 3 poses per run, you can duplicate the pose input and batch selector nodes in the workflow. The internal video approach scales — more poses just means more frames sampled. Keep all pose images at the same dimensions or the frame extraction will produce inconsistent crops.

💡 Tip: For 8 GB VRAM systems, the Q4_K_M quant runs at roughly 128 seconds for 3 images (tested on an RTX 2060 Super 8 GB with default settings). It’s usable for posing sessions where you’re generating 10–20 variations. For faster batch work, a cloud GPU cuts that to under a minute.

✅ Final Thoughts

The Wan VACE posing workflow is one of the more useful things to come out of the ComfyUI community in a while. It solves a real problem — consistent character posing without any training overhead — and it works reliably from 8 GB VRAM upward. The setup is straightforward once you have the right text encoder (the _scaled_ variant, not _enc_), and the node errors are all well-documented with working fixes.

The key call-out: this workflow is built on Wan 2.1’s video infrastructure, which means it thinks in frames. That’s why the output consistency is so strong. As of March 2026, this is the fastest no-LoRA approach for consistent character posing in ComfyUI, and the community around it is active enough that most edge cases have already been diagnosed.

Happy generating!

❓ FAQ

❓ Q: Does the Wan VACE posing workflow work with 8 GB VRAM?

Yes. Use the Q4_K_M GGUF quant from the QuantStack HuggingFace repo. Generation takes roughly 128 seconds for 3 images on an RTX 2060 Super 8 GB — slower than 16 GB setups but fully functional on default 4-step settings.

❓ Q: Why does my character come out slimmer than the reference image?

Wan VACE extends limb proportions to match the skeleton’s bone lengths. If your reference character is stockier or shorter than the pose skeleton assumes, the output will look elongated. Increasing WanVaceToVideo strength to 1.10–1.25 pulls the result closer to the reference. For persistent issues, try the ControlNet bone-proportion matching approach described in the Tips section above.

❓ Q: Can I use Wan 2.2 instead of Wan 2.1 VACE?

Wan 2.2 is a separate model series. This workflow was built and tested specifically with Wan 2.1 VACE. You can experiment with Wan 2.2 VACE models, but node connections and sampler settings may need adjustment to match the newer architecture.

❓ Q: Do all pose images need to be the same size?

Yes. The workflow’s resize node handles this automatically. Set the resize method to pad to preserve the full skeleton without cropping. Using fill/crop can cut off limbs when the skeleton aspect ratio doesn’t match exactly.

❓ Q: Can this generate back-view or side-view poses?

Yes. Back-view and side-view OpenPose skeletons work the same way as front-facing ones. Load the back or side-view skeleton into the pose slot and the VACE model handles the reorientation from your reference image.

❓ Q: Does this workflow work with anime or stylized characters, not just photorealistic ones?

Yes. The VACE conditioning is style-agnostic — it preserves the visual style of the reference image. Anime characters, stylized art, and semi-realistic renders all transfer well. The main variable is how clearly defined the reference character’s silhouette is.