Wan Video Ecosystem

Alibaba's open source video generation models: Wan 2.1, Wan 2.2, and the ecosystem of control, character, and optimization tools built around them

Last updated: February 3, 2026 Source: ~316K Discord messages + external docs First release: February 2025

Chat with this knowledge base This content is also available in NotebookLM for interactive Q&A about specific problems.

📖 Overview

Wan is an ecosystem of video generation models released by Alibaba starting in February 2025. Unlike single models like LTX Video, Wan comprises multiple model versions (2.1, 2.2), sizes (1.3B, 5B, 14B), and specialized variants (VACE, Fun, character models) that work together.

The Wan Family

Version	Key Models	Strengths
Wan 2.1	T2V 1.3B/14B, I2V 14B	Most stable, best ecosystem support, VACE control
Wan 2.2	T2V/I2V A14B (MoE), TI2V-5B, S2V-14B	Better aesthetics, MoE architecture, speech-to-video
VACE	1.3B, 14B	Controlnet-like capabilities built in (inpaint, reference, depth)
Fun	Control, InP, Camera	Canny, depth, pose, trajectory, camera control

Key Characteristics

16 FPS native output (some variants support 24 FPS)
Strong prompt adherence - follows complex scene descriptions on first try
Good camera movement - responds well to camera motion prompts
Defaults to Asian people - specify ethnicity in prompts if needed
CFG support - allows negative prompting and guidance
81 frame minimum for I2V - hardcoded requirement

🎯 Choosing a Model

What do you want to do?

Quick experimentation / Limited VRAM
→ Wan 2.1 T2V 1.3B (8GB VRAM, ~4 min for 5s @ 480p on 4090)

Best quality text-to-video
→ Wan 2.2 T2V A14B (MoE architecture, cinematic aesthetics)

Image-to-video with control
→ VACE 14B (reference images, masking, inpainting built-in)

Pose/depth/edge control
→ Fun Control (canny, depth, MLSD, pose, trajectory)

Character consistency across shots
→ Phantom (up to 4 reference images) or MAGREF (multi-subject)

Lip-sync / talking heads
→ HuMo (single person, good sync) or MultiTalk (multiple people)

Unlimited length videos
→ InfiniteTalk (streaming mode) or SVI Pro LoRA (chained generations)

Fast generation (4-8 steps)
→ LightX2V or CausVid LoRA (distillation)

Wan 2.1 vs 2.2

Aspect	Wan 2.1	Wan 2.2
Architecture	Standard transformer	MoE (separate high/low noise experts)
Aesthetics	Good	Cinematic, more detailed
Ecosystem support	Excellent (VACE, Fun, most tools)	Growing (Fun 2.2 available)
Stability	More stable, well-tested	Newer, some edge cases
VRAM	14B: ~16GB fp8	5B hybrid fits 8GB with offloading
Frame count	81 frames typical	5B supports 121 frames

Community recommendation Start with Wan 2.1 for the best ecosystem support and stability. Move to 2.2 once you need the improved aesthetics or specific features like S2V.

🖥️ Hardware Requirements

Model	VRAM (fp8)	Notes
Wan 2.1 T2V 1.3B	~8GB	Works on most consumer GPUs
Wan 2.1 T2V 14B	~16-20GB	DiffSynth can run under 24GB
Wan 2.1 I2V 14B	~16.5GB fp8	66GB in fp32
Wan 2.2 TI2V-5B	~8GB	Hybrid T2V/I2V, consumer GPU compatible
VACE 14B	~20GB	720p: ~40 min on 4090
HuMo 1.7B	32GB	480p in ~8 min
HuMo 17B	24GB+	Runs on 3090 via ComfyUI wrapper
MultiTalk	24GB (4090)	8GB possible with persistence=0
MAGREF	~70GB recommended	Multi-GPU (8x) supported via FSDP

Low VRAM Options

Wan2GP - Runs Wan on as low as 6GB VRAM with offloading
GGUF quantization - Reduces model size with some quality trade-off
Block swapping - Kijai's wrapper supports VRAM optimization
LightX2V - 8GB VRAM + 16GB RAM minimum with offloading

Text encoder is heavier than 1.3B model

Wan uses umt5-xxl text encoder which is 10-11GB. The text encoder can be quantized to fp8 without issues.

— Fannovel16, Kijai

VAE is very efficient

Wan's VAE is only 250MB in bf16, much smaller than other models. It's fast and doesn't need a config file.

— Kijai

🎬 Generation Modes

Text-to-Video (T2V)

Generate video from text prompts alone.

1.3B: 480p, fast, good for iteration
14B: Higher quality, 480p and 720p
2.2 A14B: MoE architecture, best aesthetics

T2V is much faster than I2V

T2V generation takes ~130 seconds vs much longer I2V times for similar settings.

— TK_999

Image-to-Video (I2V)

Animate a starting image into video.

81 frame minimum - hardcoded in encode_image function
Works better at 720p+ - lower resolutions perform poorly
Can chain for extensions - take last frame, feed to I2V for seamless extensions

I2V frame requirement Wan I2V requires exactly 81 frames minimum. Using 49 frames causes max_seq_len errors. This is hardcoded in the model.

First-Last-Frame (FLF2V)

Generate video between two keyframes - a starting and ending image.

Wan2.1-FLF2V-14B-720P
Frame count must be (length-1) divisible by 4
Useful for controlled transitions

Speech-to-Video (S2V)

Generate video driven by audio/speech input. Wan 2.2 only.

Wan2.2-S2V-14B
CosyVoice text-to-speech integration
Audio-driven generation

Recommended Settings

Parameter	Recommendation	Notes
Steps	30-50	50 significantly better than 30; 70 no improvement over 50
Flow shift	3-5	Lower = better details; too low = coherence issues
CFG	5-7	CFG 1.0 skips uncond for speed (~20 sec with 1.3B)
Resolution	480p or 720p	Must be divisible by 16. Video models perform best at native res.
Frame rate	16 fps	All Wan samples are 16 fps by default

CFG scheduling improves I2V

Using variable CFG through generation (e.g., 6 CFG for 18 steps, then 1 CFG for 18 steps) produces better motion and quality.

— JmySff

FP32 text encoder produces sharper results

FP32 UMT5-XXL encoder shows noticeable quality improvement over BF16, similar to improvements seen across T5 family models.

— Pedro (@LatentSpacer)

🎛️ Control Methods

VACE (Video Creation & Editing)

Built-in controlnet-like capabilities for Wan. Tasks include Reference-to-Video, Video-to-Video, and Masked Video-to-Video.

VACE

1.3B (480p only) 14B (480p + 720p) ~40 min @ 720p on 4090

Features: Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, Animate-Anything

GitHub | HuggingFace

VACE masking tip Latent noise techniques can eliminate VACE masking degradation, allowing seamless object insertion into videos. — pom

Fun Control Models

VideoX-Fun provides multiple control methods for Wan.

Control Type	Use Case
Canny	Edge-guided generation
Depth	3D structure preservation
Pose (DWPose/VitPose)	Character animation from skeleton
MLSD	Line segment detection
Trajectory	Path-based motion control
Camera	Pan, tilt, zoom, arc movements

Fun VACE 2.2 is better than VACE 2.1

Better in every way from testing, even ignoring the extra High Noise part.

— Ablejones

Camera Control

ReCamMaster

10 preset trajectories Requires 81+ frames

Camera-controlled generative rendering from single video. Supports pan, tilt, zoom, translation, arc movements with variable speed.

GitHub

Camera prompting technique for I2V

For pan right, mention 'camera reveals something' or 'camera pans down revealing a white tiled floor' - works for controlling camera movement.

— hicho

Motion Control

ATI (Any-Thing-Is-Trajectory)

Trajectory control

Finally good motion trajectory control that feels natural and responsive for video generation.

WanAnimate

1 frame overlap Sliding window extension

Extends automatically with sliding window. Uses inverted canny (white background, black edges). Can track facial features very well, even pupils.

WanAnimate strength tips

At strength 2.0 on Wan 2.2 it's too much. Blocks 0-15 yield nice results. WanAnimate is generally too strong and ruins prompt following. Use start percentage 0.5 for better motion while still getting likeness.

— Kijai, Hashu

👤 Character & Likeness

Phantom (Subject Consistency)

Phantom

1.3B: 480p/720p 14B: 480p/720p Up to 4 reference images

Single and multi-subject reference for consistent identity across generations. 14B trained on 480p, less stable at higher res.

Describe reference images accurately in prompts
Use horizontal orientations for stability
Modify seed iteratively for quality

GitHub | Paper

MAGREF (Multi-Reference)

MAGREF

~70GB VRAM recommended Multi-GPU via FSDP FP8 variant available

Generate videos from multiple reference images with subject disentanglement. Any-reference video generation.

GitHub

EchoShot (Multi-Shot Consistency)

EchoShot

Built on Wan 2.1 T2V 1.3B LLM prompt extension

Generate multiple video shots of the same person with consistent identity across different scenes.

GitHub

Lynx (Face ID)

Lynx reference strength tuning

At 0.6 strength works well. 1.0+ is too strong and creates 'face glued on video' effect. 1.5+ creates nightmare fuel. Lynx works with VACE.

— Kijai

Lynx changes frame rate

Lynx makes Wan models run at 24fps instead of 16fps. This was originally intended only for the lite version.

— Kijai

🎤 Lip-Sync & Audio-Driven

HuMo (Human-Centric Multimodal)

HuMo

1.7B: 32GB GPU, ~8 min @ 480p 17B: 3090 via ComfyUI Whisper audio encoder

Text + Audio (TA) or Text + Image + Audio (TIA) modes. Strong text prompt following, consistent subject preservation, synchronized audio-driven motion.

scale_a: Audio guidance strength
scale_t: Text guidance strength
Default 50 steps, can use 30-40 for faster
720p resolution significantly improves quality

GitHub | HuggingFace

HuMo stops talking during silence

Unlike constant mouth movement issues in other models, HuMo respects silent clips in audio.

— Juan Gea

SageAttention hurts HuMo lip sync

Visible degradation in lip sync quality when SageAttention is enabled.

— Ablejones

MultiTalk (Multi-Person)

MultiTalk

Base: Wan 2.1 I2V 14B 480p RTX 4090 level (8GB possible) Up to 15 sec (201 frames)

Single and multi-person video, cartoon characters, singing. 480p and 720p at arbitrary aspect ratios. TTS audio integration.

--mode streaming: Long video
--use_teacache: 2-3x speedup
--sample_steps: 40 recommended (10 for faster)
480p single-GPU only in current code; 720p needs multi-GPU

GitHub

InfiniteTalk (Unlimited Length)

InfiniteTalk

Base: Wan 2.1 I2V 14B + LoRA Streaming mode

Improvements over MultiTalk: reduces hand/body distortions, superior lip sync, sparse-frame dubbing.

I2V beyond 1 minute: color shifts become pronounced
V2V camera: not identical to original
Keep steps at 4 or below, or use with lightx2v LoRA

GitHub

FantasyTalking (Portrait Animation)

FantasyTalking

Base: Wan 2.1 I2V 14B 720p Wav2Vec2 audio encoder

Audio-driven motion synthesis with text prompts for behavior control. Various body ranges and poses, characters and animals.

Config	Speed	VRAM
Full precision (bf16)	15.5s/iter	40GB
7B persistent params	32.8s/iter	20GB
Minimal (0 params)	42.6s/iter	5GB

GitHub

HUMO + InfiniteTalk embed mixing

Mixing embeds from both models provides better acting, better prompt adherence, and respects starting frame details more faithfully than either model alone.

— Juan Gea

⚡ Speed & Optimization

LightX2V (Distillation)

LightX2V

4-step distilled models 8GB VRAM + 16GB RAM min Up to 42x acceleration

Distillation framework supporting Wan 2.1, 2.2, and other models. 4-step generation without CFG.

Single GPU: 1.9x (H100), 1.5x (4090D)
Multi-GPU (8x H100): 3.9x
Quantization: w8a8-int8, w8a8-fp8, w4a4-nvfp4
Supports Sage Attention, Flash Attention, TeaCache

GitHub | HuggingFace

LightX2V + FastWan LoRA speed

LightX2V + FastWan at 2 steps: 31.49 seconds vs 70.63 seconds for LightX2V alone at 4 steps.

— VRGameDevGirl84 (RTX 5090)

CausVid (Temporal Consistency)

CausVid

50 → 4 steps via DMD 9.4 FPS streaming V2 quality ≈ base Wan 2.1

Converts bidirectional diffusion to autoregressive for streaming generation. Block-wise causal attention with KV caching.

3-step inference achieves 84.27 on VBench-Long
1.3 second initial latency, then streaming
LoRA versions V1, V1.5, V2 available

GitHub

Wan2GP (Low VRAM)

Wan2GP

As low as 6GB VRAM Old NVIDIA (10XX, 20XX) support AMD Radeon support

GPU-poor friendly implementation with aggressive offloading.

Memory profiles (1-5) trade speed for VRAM
Sliding window for long videos
Text encoder caching

GitHub

Other Optimization Tips

TeaCache acceleration

TeaCache achieves ~2x speedup on Wan models. Threshold of 0.2-0.5 is optimal for MultiTalk.

PyTorch nightly with --fast flag

Uses fp16 + fp16 accumulation instead of fp16/bf16 + fp32 accumulation, 2x faster on NVIDIA GPUs.

— comfy

CFG 1.0 for speed

Using CFG 1.0 skips uncond and can make generation faster - ~20 seconds with the 1.3B model.

— Kijai

VAE tiling has no quality impact

VAE tiling at default settings vs no VAE tiling showed zero difference in quality.

— TK_999

🎓 Training

LoRA Training

Wan training is easier than Hunyuan

Better LoRA results in 2 epochs compared to hundreds with Hunyuan.

— samurzl

Training resolution limitations

Training on 256 resolution doesn't translate as well to higher resolutions compared to Hunyuan. Lower resolution training results don't scale up as effectively.

— samurzl

Control LoRAs can be trained on any condition

Can be used for deblurring, inpainting by training on videos with segments removed, interpolating, drawing trajectories based on optical flow, interpreting hand signals and body movements as motion.

— pom

Training Tips

LoRAs work with Wan video models using both wrapper and native nodes
Can train water morphing LoRA on just 6 videos for 1,000 steps
LoRA trained on AnimateDiff outputs allows Wan-level motion with AnimateDiff-style movement
Using the AnimateDiff LoRA at low strengths brings subtle motion enhancement, works at just 6 steps

Frameworks

DiffSynth-Studio: Enhanced support with quantization and LoRA training
Kijai's wrapper: LoRA weight support in ComfyUI

🔧 Troubleshooting

Common Errors

mat1 and mat2 error for CLIP loader

Problem: CLIP loader only passing clip-l data, shape mismatch errors

Solution: Reinstall transformers and tokenizers:

pip uninstall -y transformers tokenizers
pip install transformers==4.48.0 tokenizers==0.21.0

— Faust-SiN

WanSelfAttention normalized_attention_guidance error

Problem: 'takes from 3 to 4 positional arguments but 8 were given'

Solution: Disable the WanVideo Apply NAG node. Ensure KJNodes and WanVideoWrapper are up to date.

— Nao48, JohnDopamine

Block swap prefetch causing black output

Problem: Setting prefetch higher than 1 causes black output with 'invalid value encountered in cast'

Solution: Keep prefetch count at 1 when using block swap.

— patientx, Kijai

49 frames error / max_seq_len error

Problem: I2V fails with sequence length errors

Solution: Use 81 frames minimum for I2V. This is hardcoded in the model.

Quality Issues

FP8_fast quantization causes artifacts

Problem: Color/noise issues with fp8

Solution: Keep img_embed weights at fp32. Native implementation has yellowish hue with fp8.

— Kijai

Tiled VAE decode fixes washed out frames

Problem: Regular VAE decode causes extremely washed out frames (except first)

Solution: Use tiled VAE decode.

— Screeb

Video colorspace/color shift

Problem: Colors look different after save/load cycle

Solution: Use Load Video (FFmpeg) instead of VHS Load Video for correct colorspace handling.

— Austin Mroz

Long video color drift

Problem: Color shifts become pronounced in I2V beyond 1 minute

Solution: Use SVI Pro LoRA for chained generations, or accept the limitation for streaming modes.

Performance Issues

Sampler V2 preview not showing

Problem: Live preview not working on new sampler

Solution: Change 'Live Preview Method' in ComfyUI settings to latent2rgb.

— lostintranslation

SAM3 masking VRAM leak

Problem: VRAM leak when using SAM3 masking

Solution: Run SAM3, disable it, then load the video for the mask.

— mdkb

FP8 model loading slow in native

Problem: FP8 model takes 10 seconds to load in native

Solution: Use Kijai's wrapper - FP8 loads in 1 second.

— hicho

Shift Values & Settings

Shift values for distilled LoRAs

Problem: Unclear what shift value to use with distilled LoRAs

Solution: Use shift=5 for Wan distilled LoRAs. Higher resolution needs higher shift because more resolution increases signal at a given noise level.

— spacepxl

Flow shift affects motion

Lower flow shift (3-5) = better details. Too low = coherence issues. Flow shift also affects motion speed.

— Juampab12, ezMan

🔗 Resources

Official Repositories

ComfyUI Integration

Kijai's WanVideoWrapper - Supports 20+ models
Official ComfyUI Wan Tutorial
ComfyUI VACE Tutorial
ComfyUI Wan 2.2 Tutorial

Control & Fun Variants

VideoX-Fun - Canny, depth, pose, trajectory, camera
ReCamMaster - Camera control

Character & Audio Models

Phantom - Subject consistency
MAGREF - Multi-reference
EchoShot - Multi-shot consistency
HuMo - Human-centric multimodal
MultiTalk - Multi-person conversations
InfiniteTalk - Unlimited length
FantasyTalking - Portrait animation

Optimization

LightX2V - Distillation framework
CausVid - Temporal consistency LoRA
Wan2GP - Low VRAM / GPU-poor

Community

Banodoco Discord - #wan_chatter, #wan_training, #wan_comfyui, #wan_resources
Civitai - Community LoRAs and workflows

Tips from the Community

2x VAE upscaler: spacepxl's decoder acts as free 2x upscaler, kills noise grid patterns
VitPose for animals: Use thickness=20 for animal motion in SCAIL
Model mixing: Using 2.1 as high model + 2.2 low gives content of 2.1 with look of 2.2