If youβre building local AI video workflows and are confused by the rapid release cycle of the WAN foundation models, this is the guide youβve been looking for. Iβve tested everything from the original heavy checkpoints to the latest API drops, and the differences in hardware requirements and output quality are massive. This guide breaks down every major version of WANβfrom the foundational 2.1 to the MoE-powered 2.2 and the commercial 2.6 release.
π What is the WAN Video Model Family?
Developed by Alibaba Tongyi Lab, the WAN (Wide-Area Network) series is a family of advanced text-to-video (T2V) and image-to-video (I2V) generation models. Initially celebrated for bringing cinematic, high-fidelity video generation to the open-source community, the series has rapidly evolved from heavy monolithic architectures to hyper-efficient Mixture-of-Experts (MoE) designs, and most recently, to a robust commercial API ecosystem.
β‘ Why Does the Version Matter?
Choosing the right WAN version dictates not just your generation quality, but your entire workflow pipeline.
β Hardware Constraints: Later open-source versions introduced smaller, more efficient parameters (like the 5B model) that fit perfectly on consumer GPUs. β Open-Source vs. Paid: While versions 2.1 and 2.2 are free and open-weights, versions 2.5 and 2.6 are locked behind commercial APIs. β New Modalities: Later versions introduced native audio generation, multi-shot pacing, and strict character reference locking (Reference-to-Video). β Prompt Nuance: The way you prompt the model shifts drastically from 2.1 (which needed rigid parameter strings) to 2.6 (which prefers natural, flowing cinematic descriptions).
π Quick Comparison Table
| Feature | WAN 2.1 | WAN 2.2 | WAN 2.5 | WAN 2.6 |
|---|---|---|---|---|
| Availability | Open Source | Open Source | Commercial API | Commercial API |
| Architecture | Standard DiT | MoE (Mixture of Experts) | Proprietary | Proprietary |
| Audio Generation | None | Separate S2V Model | Native Audio Sync | Native Audio Sync |
| Model Sizes | 1.3B, 14B | 5B (Hybrid), 14B | Unknown (Cloud) | Unknown (Cloud) |
| Max Resolution | 720p | 720p (at 24fps) | 1080p | 1080p |
| Standout Feature | Baseline fidelity | MoE Efficiency & Speed | Audio Integration | Reference-to-Video (R2V) |
π₯ WAN 2.1: The Foundation
Status: Open Source (Apache 2.0)
WAN 2.1 was the release that put Alibaba Tongyi on the map in the open-source video community. Built on a standard Flow Matching Diffusion Transformer (DiT) framework, it offered incredible photorealism that rivaled closed-source models of the time.
π‘ Key Features
- Offered in 1.3B and 14B parameter sizes.
- Introduced a highly capable 3D Variational Autoencoder (VAE) that could encode/decode unlimited-length 1080p videos without losing historical temporal information.
β Pros
- Excellent baseline photorealism and physics interpretation.
- Highly malleable with early community LoRAs.
β Cons
- The 14B model is computationally massive and slow on consumer hardware.
- The 1.3B model struggles with complex multi-subject interactions.
- Tends to βflickerβ slightly in longer generations.
π₯ WAN 2.2: The MoE Breakthrough
Status: Open Source (Apache 2.0)
WAN 2.2 is widely considered the pinnacle of local, open-source AI video generation. It abandoned the standard monolithic DiT in favor of a Mixture-of-Experts (MoE) architecture. By splitting the denoising process into a βHigh-Noise Expertβ (for structural layout) and a βLow-Noise Expertβ (for texture refinement), it drastically reduced computation time without sacrificing quality.
π‘ Key Features
- TI2V-5B Model: A highly compressed, unified Text-to-Video and Image-to-Video model that runs flawlessly on a 24GB RTX 4090.
- Cinematic Aesthetics: Trained on an aggressively curated dataset with dense aesthetic labels (lighting, composition, contrast).
- S2V-14B Variant: A separate Speech-to-Video model that drives character lip-syncing from audio inputs.
β Pros
- The 5B model generates 720p at 24fps blazing fast compared to 2.1.
- MoE architecture means less VRAM is required for the same quality output.
- Temporal consistency is practically perfect; characters stay locked in.
β Cons
- The audio-sync requires running a completely separate model (S2V) rather than doing it in a single unified pass.
- Prompting requires more descriptive, natural language than 2.1 to achieve the best results.
π₯ WAN 2.5 & 2.6: The Commercial Shift
Status: Closed Source (API Only)
With WAN 2.5 and the subsequent 2.6 release, Alibaba shifted from an open-source model strategy to a commercial cloud API offering. While this frustrated the local ComfyUI community, the leap in capabilities is undeniable.
π‘ Key Features
- Native Audio-Visual Sync (2.5+): Audio is generated alongside the video in one pass, complete with phoneme-aware lip movements.
- Reference-to-Video (R2V) (2.6): Allows you to input 1-3 reference videos to maintain absolute character identity across entirely new scenes.
- Multi-Shot Storytelling (2.6): You can prompt for multiple camera angles and scene changes in a single 15-second generation block.
- Expanded Aspect Ratios: Native support for 16:9, 9:16, 1:1, 4:3, and 3:4.
β Pros
- Zero hardware requirements; completely cloud-based.
- Breathtaking 1080p outputs with unparalleled character consistency.
- Drastically reduces the need for external video editors due to multi-shot prompting.
β Cons
- Pay-per-generation pricing model.
- Strictly censored compared to the uncensored open-weights 2.2.
- Cannot be run locally in ComfyUI or customized with your own LoRAs.
π οΈ Troubleshooting Local Setups (2.1 & 2.2)
If you are sticking with the open-source versions locally, you might run into a few migration hurdles.
| Error | Cause | Fix |
|---|---|---|
Out of Memory migrating from 2.1 to 2.2 14B | The MoE experts in 2.2 require different VRAM allocation during the handoff. | Ensure --offload_model True is set, or switch to the 5B Hybrid model. |
| Gray/Staticky outputs in WAN 2.2 | VAE tensor mismatch or Flow Shift is incorrectly set. | Use Flow Shift 5.0 for 720p and 3.0 for 480p. Force VAE to torch.float32. |
| Subjects morphing in 2.2 | The prompt is too short or technically rigid (e.g. relying on tags). | Use an LLM to rewrite your prompt into a natural, descriptive paragraph. |
π‘ Tips & Best Practices
If you are navigating the WAN ecosystem, keep these tips in mind to maximize your results:
π‘ Tip: Use 2.2βs 5B Model for Prototyping. Even if you have the VRAM for the 14B model, use the TI2V-5B model to rapidly test prompts and camera movements. It is significantly faster and often gets the composition 90% correct.
π‘ Tip: Combine Open and Closed Workflows. Use WAN 2.2 locally for unrestricted, complex generations, then use a cheap cloud upscaler or audio-sync tool to polish the final output without paying the premium WAN 2.6 API fees.
π‘ Tip: Rethink Your Prompts for 2.2. WAN 2.1 liked structured, comma-separated tags. WAN 2.2 is trained on dense captions. Change your prompts from
"cat, sunglasses, beach, 4k"to"A fluffy white cat wearing black sunglasses relaxes on a surfboard at a sunny beach..."
Want to run WAN 2.2 14B at full speed?
Rent an RTX 6000 Ada on RunPod and generate 720p MoE videos in minutes. No local hardware required.
β Final Thoughts
The WAN series has matured incredibly fast. If you are a developer, an indie creator, or just a ComfyUI enthusiast, WAN 2.2 is the undisputed champion of the open-source video worldβits MoE architecture is a masterclass in efficiency. However, if you are running a commercial studio and need perfect character consistency and lip-syncing without the hassle of a local pipeline, paying the API toll for WAN 2.6 is well worth the cost. The best model is the one that fits your hardware, your budget, and your workflow. Happy generating!
β FAQ
Q: Will WAN 2.6 ever be released open-source?
A: It is highly unlikely. Alibaba is heavily monetizing the API layer for 2.5 and 2.6. The community consensus is that the open-source era for WAN ended with 2.2.
Q: Can I use WAN 2.1 LoRAs on WAN 2.2?
A: No. The architecture shifted from a standard DiT to a Mixture-of-Experts (MoE) system. Your old LoRAs will not work and must be retrained on the 2.2 architecture.
Q: What is the difference between Text-to-Video and Image-to-Video in WAN 2.2?
A: The 14B models are split into dedicated T2V and I2V checkpoints, optimizing them for their specific tasks. However, the 5B model is a βHybridβ (TI2V) that can handle both seamlessly.
π Additional Resources
- WAN 2.2 Official GitHub Repository
- WAN 2.1 Official GitHub Repository
- WAN 2.6 API Documentation (Alibaba Cloud)