Wan Video Ecosystem
Alibaba's open source video generation models: Wan 2.1, Wan 2.2, and the ecosystem of control, character, and optimization tools built around them
Overview
Wan is an ecosystem of video generation models released by Alibaba starting in February 2025. Unlike single models like LTX Video, Wan comprises multiple model versions (2.1, 2.2), sizes (1.3B, 5B, 14B), and specialized variants (VACE, Fun, character models) that work together.
The Wan Family
| Version | Key Models | Strengths |
|---|---|---|
| Wan 2.1 | T2V 1.3B/14B, I2V 14B | Most stable, best ecosystem support, VACE control |
| Wan 2.2 | T2V/I2V A14B (MoE), TI2V-5B, S2V-14B | Better aesthetics, MoE architecture, speech-to-video |
| VACE | 1.3B, 14B | Controlnet-like capabilities built in (inpaint, reference, depth) |
| Fun | Control, InP, Camera | Canny, depth, pose, trajectory, camera control |
Key Characteristics
- 16 FPS native output (some variants support 24 FPS)
- Strong prompt adherence - follows complex scene descriptions on first try
- Good camera movement - responds well to camera motion prompts
- Defaults to Asian people - specify ethnicity in prompts if needed
- CFG support - allows negative prompting and guidance
- 81 frame minimum for I2V - hardcoded requirement
Choosing a Model
What do you want to do?
→ Wan 2.1 T2V 1.3B (8GB VRAM, ~4 min for 5s @ 480p on 4090)
→ Wan 2.2 T2V A14B (MoE architecture, cinematic aesthetics)
→ VACE 14B (reference images, masking, inpainting built-in)
→ Fun Control (canny, depth, MLSD, pose, trajectory)
→ Phantom (up to 4 reference images) or MAGREF (multi-subject)
→ HuMo (single person, good sync) or MultiTalk (multiple people)
→ InfiniteTalk (streaming mode) or SVI Pro LoRA (chained generations)
→ LightX2V or CausVid LoRA (distillation)
Wan 2.1 vs 2.2
| Aspect | Wan 2.1 | Wan 2.2 |
|---|---|---|
| Architecture | Standard transformer | MoE (separate high/low noise experts) |
| Aesthetics | Good | Cinematic, more detailed |
| Ecosystem support | Excellent (VACE, Fun, most tools) | Growing (Fun 2.2 available) |
| Stability | More stable, well-tested | Newer, some edge cases |
| VRAM | 14B: ~16GB fp8 | 5B hybrid fits 8GB with offloading |
| Frame count | 81 frames typical | 5B supports 121 frames |
Hardware Requirements
| Model | VRAM (fp8) | Notes |
|---|---|---|
| Wan 2.1 T2V 1.3B | ~8GB | Works on most consumer GPUs |
| Wan 2.1 T2V 14B | ~16-20GB | DiffSynth can run under 24GB |
| Wan 2.1 I2V 14B | ~16.5GB fp8 | 66GB in fp32 |
| Wan 2.2 TI2V-5B | ~8GB | Hybrid T2V/I2V, consumer GPU compatible |
| VACE 14B | ~20GB | 720p: ~40 min on 4090 |
| HuMo 1.7B | 32GB | 480p in ~8 min |
| HuMo 17B | 24GB+ | Runs on 3090 via ComfyUI wrapper |
| MultiTalk | 24GB (4090) | 8GB possible with persistence=0 |
| MAGREF | ~70GB recommended | Multi-GPU (8x) supported via FSDP |
Low VRAM Options
- Wan2GP - Runs Wan on as low as 6GB VRAM with offloading
- GGUF quantization - Reduces model size with some quality trade-off
- Block swapping - Kijai's wrapper supports VRAM optimization
- LightX2V - 8GB VRAM + 16GB RAM minimum with offloading
Text encoder is heavier than 1.3B model
Wan uses umt5-xxl text encoder which is 10-11GB. The text encoder can be quantized to fp8 without issues.
— Fannovel16, KijaiVAE is very efficient
Wan's VAE is only 250MB in bf16, much smaller than other models. It's fast and doesn't need a config file.
— KijaiGeneration Modes
Text-to-Video (T2V)
Generate video from text prompts alone.
- 1.3B: 480p, fast, good for iteration
- 14B: Higher quality, 480p and 720p
- 2.2 A14B: MoE architecture, best aesthetics
T2V is much faster than I2V
T2V generation takes ~130 seconds vs much longer I2V times for similar settings.
— TK_999Image-to-Video (I2V)
Animate a starting image into video.
- 81 frame minimum - hardcoded in encode_image function
- Works better at 720p+ - lower resolutions perform poorly
- Can chain for extensions - take last frame, feed to I2V for seamless extensions
First-Last-Frame (FLF2V)
Generate video between two keyframes - a starting and ending image.
- Wan2.1-FLF2V-14B-720P
- Frame count must be (length-1) divisible by 4
- Useful for controlled transitions
Speech-to-Video (S2V)
Generate video driven by audio/speech input. Wan 2.2 only.
- Wan2.2-S2V-14B
- CosyVoice text-to-speech integration
- Audio-driven generation
Recommended Settings
| Parameter | Recommendation | Notes |
|---|---|---|
| Steps | 30-50 | 50 significantly better than 30; 70 no improvement over 50 |
| Flow shift | 3-5 | Lower = better details; too low = coherence issues |
| CFG | 5-7 | CFG 1.0 skips uncond for speed (~20 sec with 1.3B) |
| Resolution | 480p or 720p | Must be divisible by 16. Video models perform best at native res. |
| Frame rate | 16 fps | All Wan samples are 16 fps by default |
CFG scheduling improves I2V
Using variable CFG through generation (e.g., 6 CFG for 18 steps, then 1 CFG for 18 steps) produces better motion and quality.
— JmySffFP32 text encoder produces sharper results
FP32 UMT5-XXL encoder shows noticeable quality improvement over BF16, similar to improvements seen across T5 family models.
— Pedro (@LatentSpacer)Control Methods
VACE (Video Creation & Editing)
Built-in controlnet-like capabilities for Wan. Tasks include Reference-to-Video, Video-to-Video, and Masked Video-to-Video.
VACE
Features: Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, Animate-Anything
Fun Control Models
VideoX-Fun provides multiple control methods for Wan.
| Control Type | Use Case |
|---|---|
| Canny | Edge-guided generation |
| Depth | 3D structure preservation |
| Pose (DWPose/VitPose) | Character animation from skeleton |
| MLSD | Line segment detection |
| Trajectory | Path-based motion control |
| Camera | Pan, tilt, zoom, arc movements |
Fun VACE 2.2 is better than VACE 2.1
Better in every way from testing, even ignoring the extra High Noise part.
— AblejonesCamera Control
ReCamMaster
Camera-controlled generative rendering from single video. Supports pan, tilt, zoom, translation, arc movements with variable speed.
Camera prompting technique for I2V
For pan right, mention 'camera reveals something' or 'camera pans down revealing a white tiled floor' - works for controlling camera movement.
— hichoMotion Control
ATI (Any-Thing-Is-Trajectory)
Finally good motion trajectory control that feels natural and responsive for video generation.
WanAnimate
Extends automatically with sliding window. Uses inverted canny (white background, black edges). Can track facial features very well, even pupils.
WanAnimate strength tips
At strength 2.0 on Wan 2.2 it's too much. Blocks 0-15 yield nice results. WanAnimate is generally too strong and ruins prompt following. Use start percentage 0.5 for better motion while still getting likeness.
— Kijai, HashuCharacter & Likeness
Phantom (Subject Consistency)
Phantom
Single and multi-subject reference for consistent identity across generations. 14B trained on 480p, less stable at higher res.
- Describe reference images accurately in prompts
- Use horizontal orientations for stability
- Modify seed iteratively for quality
MAGREF (Multi-Reference)
MAGREF
Generate videos from multiple reference images with subject disentanglement. Any-reference video generation.
EchoShot (Multi-Shot Consistency)
EchoShot
Generate multiple video shots of the same person with consistent identity across different scenes.
Lynx (Face ID)
Lynx reference strength tuning
At 0.6 strength works well. 1.0+ is too strong and creates 'face glued on video' effect. 1.5+ creates nightmare fuel. Lynx works with VACE.
— KijaiLynx changes frame rate
Lynx makes Wan models run at 24fps instead of 16fps. This was originally intended only for the lite version.
— KijaiLip-Sync & Audio-Driven
HuMo (Human-Centric Multimodal)
HuMo
Text + Audio (TA) or Text + Image + Audio (TIA) modes. Strong text prompt following, consistent subject preservation, synchronized audio-driven motion.
scale_a: Audio guidance strengthscale_t: Text guidance strength- Default 50 steps, can use 30-40 for faster
- 720p resolution significantly improves quality
HuMo stops talking during silence
Unlike constant mouth movement issues in other models, HuMo respects silent clips in audio.
— Juan GeaSageAttention hurts HuMo lip sync
Visible degradation in lip sync quality when SageAttention is enabled.
— AblejonesMultiTalk (Multi-Person)
MultiTalk
Single and multi-person video, cartoon characters, singing. 480p and 720p at arbitrary aspect ratios. TTS audio integration.
--mode streaming: Long video--use_teacache: 2-3x speedup--sample_steps: 40 recommended (10 for faster)- 480p single-GPU only in current code; 720p needs multi-GPU
InfiniteTalk (Unlimited Length)
InfiniteTalk
Improvements over MultiTalk: reduces hand/body distortions, superior lip sync, sparse-frame dubbing.
- I2V beyond 1 minute: color shifts become pronounced
- V2V camera: not identical to original
- Keep steps at 4 or below, or use with lightx2v LoRA
FantasyTalking (Portrait Animation)
FantasyTalking
Audio-driven motion synthesis with text prompts for behavior control. Various body ranges and poses, characters and animals.
| Config | Speed | VRAM |
|---|---|---|
| Full precision (bf16) | 15.5s/iter | 40GB |
| 7B persistent params | 32.8s/iter | 20GB |
| Minimal (0 params) | 42.6s/iter | 5GB |
HUMO + InfiniteTalk embed mixing
Mixing embeds from both models provides better acting, better prompt adherence, and respects starting frame details more faithfully than either model alone.
— Juan GeaSpeed & Optimization
LightX2V (Distillation)
LightX2V
Distillation framework supporting Wan 2.1, 2.2, and other models. 4-step generation without CFG.
- Single GPU: 1.9x (H100), 1.5x (4090D)
- Multi-GPU (8x H100): 3.9x
- Quantization: w8a8-int8, w8a8-fp8, w4a4-nvfp4
- Supports Sage Attention, Flash Attention, TeaCache
LightX2V + FastWan LoRA speed
LightX2V + FastWan at 2 steps: 31.49 seconds vs 70.63 seconds for LightX2V alone at 4 steps.
— VRGameDevGirl84 (RTX 5090)CausVid (Temporal Consistency)
CausVid
Converts bidirectional diffusion to autoregressive for streaming generation. Block-wise causal attention with KV caching.
- 3-step inference achieves 84.27 on VBench-Long
- 1.3 second initial latency, then streaming
- LoRA versions V1, V1.5, V2 available
Wan2GP (Low VRAM)
Wan2GP
GPU-poor friendly implementation with aggressive offloading.
- Memory profiles (1-5) trade speed for VRAM
- Sliding window for long videos
- Text encoder caching
Other Optimization Tips
TeaCache acceleration
TeaCache achieves ~2x speedup on Wan models. Threshold of 0.2-0.5 is optimal for MultiTalk.
PyTorch nightly with --fast flag
Uses fp16 + fp16 accumulation instead of fp16/bf16 + fp32 accumulation, 2x faster on NVIDIA GPUs.
— comfyCFG 1.0 for speed
Using CFG 1.0 skips uncond and can make generation faster - ~20 seconds with the 1.3B model.
— KijaiVAE tiling has no quality impact
VAE tiling at default settings vs no VAE tiling showed zero difference in quality.
— TK_999Training
LoRA Training
Wan training is easier than Hunyuan
Better LoRA results in 2 epochs compared to hundreds with Hunyuan.
— samurzlTraining resolution limitations
Training on 256 resolution doesn't translate as well to higher resolutions compared to Hunyuan. Lower resolution training results don't scale up as effectively.
— samurzlControl LoRAs can be trained on any condition
Can be used for deblurring, inpainting by training on videos with segments removed, interpolating, drawing trajectories based on optical flow, interpreting hand signals and body movements as motion.
— pomTraining Tips
- LoRAs work with Wan video models using both wrapper and native nodes
- Can train water morphing LoRA on just 6 videos for 1,000 steps
- LoRA trained on AnimateDiff outputs allows Wan-level motion with AnimateDiff-style movement
- Using the AnimateDiff LoRA at low strengths brings subtle motion enhancement, works at just 6 steps
Frameworks
- DiffSynth-Studio: Enhanced support with quantization and LoRA training
- Kijai's wrapper: LoRA weight support in ComfyUI
Troubleshooting
Common Errors
mat1 and mat2 error for CLIP loader
Problem: CLIP loader only passing clip-l data, shape mismatch errors
Solution: Reinstall transformers and tokenizers:
pip uninstall -y transformers tokenizers
pip install transformers==4.48.0 tokenizers==0.21.0
— Faust-SiN
WanSelfAttention normalized_attention_guidance error
Problem: 'takes from 3 to 4 positional arguments but 8 were given'
Solution: Disable the WanVideo Apply NAG node. Ensure KJNodes and WanVideoWrapper are up to date.
— Nao48, JohnDopamineBlock swap prefetch causing black output
Problem: Setting prefetch higher than 1 causes black output with 'invalid value encountered in cast'
Solution: Keep prefetch count at 1 when using block swap.
— patientx, Kijai49 frames error / max_seq_len error
Problem: I2V fails with sequence length errors
Solution: Use 81 frames minimum for I2V. This is hardcoded in the model.
Quality Issues
FP8_fast quantization causes artifacts
Problem: Color/noise issues with fp8
Solution: Keep img_embed weights at fp32. Native implementation has yellowish hue with fp8.
— KijaiTiled VAE decode fixes washed out frames
Problem: Regular VAE decode causes extremely washed out frames (except first)
Solution: Use tiled VAE decode.
— ScreebVideo colorspace/color shift
Problem: Colors look different after save/load cycle
Solution: Use Load Video (FFmpeg) instead of VHS Load Video for correct colorspace handling.
— Austin MrozLong video color drift
Problem: Color shifts become pronounced in I2V beyond 1 minute
Solution: Use SVI Pro LoRA for chained generations, or accept the limitation for streaming modes.
Performance Issues
Sampler V2 preview not showing
Problem: Live preview not working on new sampler
Solution: Change 'Live Preview Method' in ComfyUI settings to latent2rgb.
— lostintranslationSAM3 masking VRAM leak
Problem: VRAM leak when using SAM3 masking
Solution: Run SAM3, disable it, then load the video for the mask.
— mdkbFP8 model loading slow in native
Problem: FP8 model takes 10 seconds to load in native
Solution: Use Kijai's wrapper - FP8 loads in 1 second.
— hichoShift Values & Settings
Shift values for distilled LoRAs
Problem: Unclear what shift value to use with distilled LoRAs
Solution: Use shift=5 for Wan distilled LoRAs. Higher resolution needs higher shift because more resolution increases signal at a given noise level.
— spacepxlFlow shift affects motion
Lower flow shift (3-5) = better details. Too low = coherence issues. Flow shift also affects motion speed.
— Juampab12, ezManResources
Official Repositories
ComfyUI Integration
- Kijai's WanVideoWrapper - Supports 20+ models
- Official ComfyUI Wan Tutorial
- ComfyUI VACE Tutorial
- ComfyUI Wan 2.2 Tutorial
Control & Fun Variants
- VideoX-Fun - Canny, depth, pose, trajectory, camera
- ReCamMaster - Camera control
Character & Audio Models
- Phantom - Subject consistency
- MAGREF - Multi-reference
- EchoShot - Multi-shot consistency
- HuMo - Human-centric multimodal
- MultiTalk - Multi-person conversations
- InfiniteTalk - Unlimited length
- FantasyTalking - Portrait animation
Optimization
Community
- Banodoco Discord - #wan_chatter, #wan_training, #wan_comfyui, #wan_resources
- Civitai - Community LoRAs and workflows
Tips from the Community
- 2x VAE upscaler: spacepxl's decoder acts as free 2x upscaler, kills noise grid patterns
- VitPose for animals: Use thickness=20 for animal motion in SCAIL
- Model mixing: Using 2.1 as high model + 2.2 low gives content of 2.1 with look of 2.2