Wan Video Ecosystem

Alibaba's open source video generation models: Wan 2.1, Wan 2.2, and the ecosystem of control, character, and optimization tools built around them

Last updated: February 3, 2026 Source: ~316K Discord messages + external docs First release: February 2025
Chat with this knowledge base This content is also available in NotebookLM for interactive Q&A about specific problems.

📖 Overview

Wan is an ecosystem of video generation models released by Alibaba starting in February 2025. Unlike single models like LTX Video, Wan comprises multiple model versions (2.1, 2.2), sizes (1.3B, 5B, 14B), and specialized variants (VACE, Fun, character models) that work together.

The Wan Family

Version Key Models Strengths
Wan 2.1 T2V 1.3B/14B, I2V 14B Most stable, best ecosystem support, VACE control
Wan 2.2 T2V/I2V A14B (MoE), TI2V-5B, S2V-14B Better aesthetics, MoE architecture, speech-to-video
VACE 1.3B, 14B Controlnet-like capabilities built in (inpaint, reference, depth)
Fun Control, InP, Camera Canny, depth, pose, trajectory, camera control

Key Characteristics

  • 16 FPS native output (some variants support 24 FPS)
  • Strong prompt adherence - follows complex scene descriptions on first try
  • Good camera movement - responds well to camera motion prompts
  • Defaults to Asian people - specify ethnicity in prompts if needed
  • CFG support - allows negative prompting and guidance
  • 81 frame minimum for I2V - hardcoded requirement

🎯 Choosing a Model

What do you want to do?

Quick experimentation / Limited VRAM
Wan 2.1 T2V 1.3B (8GB VRAM, ~4 min for 5s @ 480p on 4090)
Best quality text-to-video
Wan 2.2 T2V A14B (MoE architecture, cinematic aesthetics)
Image-to-video with control
VACE 14B (reference images, masking, inpainting built-in)
Pose/depth/edge control
Fun Control (canny, depth, MLSD, pose, trajectory)
Character consistency across shots
Phantom (up to 4 reference images) or MAGREF (multi-subject)
Lip-sync / talking heads
HuMo (single person, good sync) or MultiTalk (multiple people)
Unlimited length videos
InfiniteTalk (streaming mode) or SVI Pro LoRA (chained generations)
Fast generation (4-8 steps)
LightX2V or CausVid LoRA (distillation)

Wan 2.1 vs 2.2

Aspect Wan 2.1 Wan 2.2
Architecture Standard transformer MoE (separate high/low noise experts)
Aesthetics Good Cinematic, more detailed
Ecosystem support Excellent (VACE, Fun, most tools) Growing (Fun 2.2 available)
Stability More stable, well-tested Newer, some edge cases
VRAM 14B: ~16GB fp8 5B hybrid fits 8GB with offloading
Frame count 81 frames typical 5B supports 121 frames
Community recommendation Start with Wan 2.1 for the best ecosystem support and stability. Move to 2.2 once you need the improved aesthetics or specific features like S2V.

🖥️ Hardware Requirements

Model VRAM (fp8) Notes
Wan 2.1 T2V 1.3B ~8GB Works on most consumer GPUs
Wan 2.1 T2V 14B ~16-20GB DiffSynth can run under 24GB
Wan 2.1 I2V 14B ~16.5GB fp8 66GB in fp32
Wan 2.2 TI2V-5B ~8GB Hybrid T2V/I2V, consumer GPU compatible
VACE 14B ~20GB 720p: ~40 min on 4090
HuMo 1.7B 32GB 480p in ~8 min
HuMo 17B 24GB+ Runs on 3090 via ComfyUI wrapper
MultiTalk 24GB (4090) 8GB possible with persistence=0
MAGREF ~70GB recommended Multi-GPU (8x) supported via FSDP

Low VRAM Options

  • Wan2GP - Runs Wan on as low as 6GB VRAM with offloading
  • GGUF quantization - Reduces model size with some quality trade-off
  • Block swapping - Kijai's wrapper supports VRAM optimization
  • LightX2V - 8GB VRAM + 16GB RAM minimum with offloading

Text encoder is heavier than 1.3B model

Wan uses umt5-xxl text encoder which is 10-11GB. The text encoder can be quantized to fp8 without issues.

— Fannovel16, Kijai

VAE is very efficient

Wan's VAE is only 250MB in bf16, much smaller than other models. It's fast and doesn't need a config file.

— Kijai

🎬 Generation Modes

Text-to-Video (T2V)

Generate video from text prompts alone.

  • 1.3B: 480p, fast, good for iteration
  • 14B: Higher quality, 480p and 720p
  • 2.2 A14B: MoE architecture, best aesthetics

T2V is much faster than I2V

T2V generation takes ~130 seconds vs much longer I2V times for similar settings.

— TK_999

Image-to-Video (I2V)

Animate a starting image into video.

  • 81 frame minimum - hardcoded in encode_image function
  • Works better at 720p+ - lower resolutions perform poorly
  • Can chain for extensions - take last frame, feed to I2V for seamless extensions
I2V frame requirement Wan I2V requires exactly 81 frames minimum. Using 49 frames causes max_seq_len errors. This is hardcoded in the model.

First-Last-Frame (FLF2V)

Generate video between two keyframes - a starting and ending image.

  • Wan2.1-FLF2V-14B-720P
  • Frame count must be (length-1) divisible by 4
  • Useful for controlled transitions

Speech-to-Video (S2V)

Generate video driven by audio/speech input. Wan 2.2 only.

  • Wan2.2-S2V-14B
  • CosyVoice text-to-speech integration
  • Audio-driven generation

Recommended Settings

Parameter Recommendation Notes
Steps 30-50 50 significantly better than 30; 70 no improvement over 50
Flow shift 3-5 Lower = better details; too low = coherence issues
CFG 5-7 CFG 1.0 skips uncond for speed (~20 sec with 1.3B)
Resolution 480p or 720p Must be divisible by 16. Video models perform best at native res.
Frame rate 16 fps All Wan samples are 16 fps by default

CFG scheduling improves I2V

Using variable CFG through generation (e.g., 6 CFG for 18 steps, then 1 CFG for 18 steps) produces better motion and quality.

— JmySff

FP32 text encoder produces sharper results

FP32 UMT5-XXL encoder shows noticeable quality improvement over BF16, similar to improvements seen across T5 family models.

— Pedro (@LatentSpacer)

🎛️ Control Methods

VACE (Video Creation & Editing)

Built-in controlnet-like capabilities for Wan. Tasks include Reference-to-Video, Video-to-Video, and Masked Video-to-Video.

VACE

1.3B (480p only) 14B (480p + 720p) ~40 min @ 720p on 4090

Features: Move-Anything, Swap-Anything, Reference-Anything, Expand-Anything, Animate-Anything

GitHub | HuggingFace

VACE masking tip Latent noise techniques can eliminate VACE masking degradation, allowing seamless object insertion into videos. — pom

Fun Control Models

VideoX-Fun provides multiple control methods for Wan.

Control Type Use Case
Canny Edge-guided generation
Depth 3D structure preservation
Pose (DWPose/VitPose) Character animation from skeleton
MLSD Line segment detection
Trajectory Path-based motion control
Camera Pan, tilt, zoom, arc movements

Fun VACE 2.2 is better than VACE 2.1

Better in every way from testing, even ignoring the extra High Noise part.

— Ablejones

Camera Control

ReCamMaster

10 preset trajectories Requires 81+ frames

Camera-controlled generative rendering from single video. Supports pan, tilt, zoom, translation, arc movements with variable speed.

GitHub

Camera prompting technique for I2V

For pan right, mention 'camera reveals something' or 'camera pans down revealing a white tiled floor' - works for controlling camera movement.

— hicho

Motion Control

ATI (Any-Thing-Is-Trajectory)

Trajectory control

Finally good motion trajectory control that feels natural and responsive for video generation.

WanAnimate

1 frame overlap Sliding window extension

Extends automatically with sliding window. Uses inverted canny (white background, black edges). Can track facial features very well, even pupils.

WanAnimate strength tips

At strength 2.0 on Wan 2.2 it's too much. Blocks 0-15 yield nice results. WanAnimate is generally too strong and ruins prompt following. Use start percentage 0.5 for better motion while still getting likeness.

— Kijai, Hashu

👤 Character & Likeness

Phantom (Subject Consistency)

Phantom

1.3B: 480p/720p 14B: 480p/720p Up to 4 reference images

Single and multi-subject reference for consistent identity across generations. 14B trained on 480p, less stable at higher res.

  • Describe reference images accurately in prompts
  • Use horizontal orientations for stability
  • Modify seed iteratively for quality

GitHub | Paper

MAGREF (Multi-Reference)

MAGREF

~70GB VRAM recommended Multi-GPU via FSDP FP8 variant available

Generate videos from multiple reference images with subject disentanglement. Any-reference video generation.

GitHub

EchoShot (Multi-Shot Consistency)

EchoShot

Built on Wan 2.1 T2V 1.3B LLM prompt extension

Generate multiple video shots of the same person with consistent identity across different scenes.

GitHub

Lynx (Face ID)

Lynx reference strength tuning

At 0.6 strength works well. 1.0+ is too strong and creates 'face glued on video' effect. 1.5+ creates nightmare fuel. Lynx works with VACE.

— Kijai

Lynx changes frame rate

Lynx makes Wan models run at 24fps instead of 16fps. This was originally intended only for the lite version.

— Kijai

🎤 Lip-Sync & Audio-Driven

HuMo (Human-Centric Multimodal)

HuMo

1.7B: 32GB GPU, ~8 min @ 480p 17B: 3090 via ComfyUI Whisper audio encoder

Text + Audio (TA) or Text + Image + Audio (TIA) modes. Strong text prompt following, consistent subject preservation, synchronized audio-driven motion.

  • scale_a: Audio guidance strength
  • scale_t: Text guidance strength
  • Default 50 steps, can use 30-40 for faster
  • 720p resolution significantly improves quality

GitHub | HuggingFace

HuMo stops talking during silence

Unlike constant mouth movement issues in other models, HuMo respects silent clips in audio.

— Juan Gea

SageAttention hurts HuMo lip sync

Visible degradation in lip sync quality when SageAttention is enabled.

— Ablejones

MultiTalk (Multi-Person)

MultiTalk

Base: Wan 2.1 I2V 14B 480p RTX 4090 level (8GB possible) Up to 15 sec (201 frames)

Single and multi-person video, cartoon characters, singing. 480p and 720p at arbitrary aspect ratios. TTS audio integration.

  • --mode streaming: Long video
  • --use_teacache: 2-3x speedup
  • --sample_steps: 40 recommended (10 for faster)
  • 480p single-GPU only in current code; 720p needs multi-GPU

GitHub

InfiniteTalk (Unlimited Length)

InfiniteTalk

Base: Wan 2.1 I2V 14B + LoRA Streaming mode

Improvements over MultiTalk: reduces hand/body distortions, superior lip sync, sparse-frame dubbing.

  • I2V beyond 1 minute: color shifts become pronounced
  • V2V camera: not identical to original
  • Keep steps at 4 or below, or use with lightx2v LoRA

GitHub

FantasyTalking (Portrait Animation)

FantasyTalking

Base: Wan 2.1 I2V 14B 720p Wav2Vec2 audio encoder

Audio-driven motion synthesis with text prompts for behavior control. Various body ranges and poses, characters and animals.

Config Speed VRAM
Full precision (bf16) 15.5s/iter 40GB
7B persistent params 32.8s/iter 20GB
Minimal (0 params) 42.6s/iter 5GB

GitHub

HUMO + InfiniteTalk embed mixing

Mixing embeds from both models provides better acting, better prompt adherence, and respects starting frame details more faithfully than either model alone.

— Juan Gea

Speed & Optimization

LightX2V (Distillation)

LightX2V

4-step distilled models 8GB VRAM + 16GB RAM min Up to 42x acceleration

Distillation framework supporting Wan 2.1, 2.2, and other models. 4-step generation without CFG.

  • Single GPU: 1.9x (H100), 1.5x (4090D)
  • Multi-GPU (8x H100): 3.9x
  • Quantization: w8a8-int8, w8a8-fp8, w4a4-nvfp4
  • Supports Sage Attention, Flash Attention, TeaCache

GitHub | HuggingFace

LightX2V + FastWan LoRA speed

LightX2V + FastWan at 2 steps: 31.49 seconds vs 70.63 seconds for LightX2V alone at 4 steps.

— VRGameDevGirl84 (RTX 5090)

CausVid (Temporal Consistency)

CausVid

50 → 4 steps via DMD 9.4 FPS streaming V2 quality ≈ base Wan 2.1

Converts bidirectional diffusion to autoregressive for streaming generation. Block-wise causal attention with KV caching.

  • 3-step inference achieves 84.27 on VBench-Long
  • 1.3 second initial latency, then streaming
  • LoRA versions V1, V1.5, V2 available

GitHub

Wan2GP (Low VRAM)

Wan2GP

As low as 6GB VRAM Old NVIDIA (10XX, 20XX) support AMD Radeon support

GPU-poor friendly implementation with aggressive offloading.

  • Memory profiles (1-5) trade speed for VRAM
  • Sliding window for long videos
  • Text encoder caching

GitHub

Other Optimization Tips

TeaCache acceleration

TeaCache achieves ~2x speedup on Wan models. Threshold of 0.2-0.5 is optimal for MultiTalk.

PyTorch nightly with --fast flag

Uses fp16 + fp16 accumulation instead of fp16/bf16 + fp32 accumulation, 2x faster on NVIDIA GPUs.

— comfy

CFG 1.0 for speed

Using CFG 1.0 skips uncond and can make generation faster - ~20 seconds with the 1.3B model.

— Kijai

VAE tiling has no quality impact

VAE tiling at default settings vs no VAE tiling showed zero difference in quality.

— TK_999

🎓 Training

LoRA Training

Wan training is easier than Hunyuan

Better LoRA results in 2 epochs compared to hundreds with Hunyuan.

— samurzl

Training resolution limitations

Training on 256 resolution doesn't translate as well to higher resolutions compared to Hunyuan. Lower resolution training results don't scale up as effectively.

— samurzl

Control LoRAs can be trained on any condition

Can be used for deblurring, inpainting by training on videos with segments removed, interpolating, drawing trajectories based on optical flow, interpreting hand signals and body movements as motion.

— pom

Training Tips

  • LoRAs work with Wan video models using both wrapper and native nodes
  • Can train water morphing LoRA on just 6 videos for 1,000 steps
  • LoRA trained on AnimateDiff outputs allows Wan-level motion with AnimateDiff-style movement
  • Using the AnimateDiff LoRA at low strengths brings subtle motion enhancement, works at just 6 steps

Frameworks

  • DiffSynth-Studio: Enhanced support with quantization and LoRA training
  • Kijai's wrapper: LoRA weight support in ComfyUI

🔧 Troubleshooting

Common Errors

mat1 and mat2 error for CLIP loader

Problem: CLIP loader only passing clip-l data, shape mismatch errors

Solution: Reinstall transformers and tokenizers:

pip uninstall -y transformers tokenizers
pip install transformers==4.48.0 tokenizers==0.21.0
— Faust-SiN

WanSelfAttention normalized_attention_guidance error

Problem: 'takes from 3 to 4 positional arguments but 8 were given'

Solution: Disable the WanVideo Apply NAG node. Ensure KJNodes and WanVideoWrapper are up to date.

— Nao48, JohnDopamine

Block swap prefetch causing black output

Problem: Setting prefetch higher than 1 causes black output with 'invalid value encountered in cast'

Solution: Keep prefetch count at 1 when using block swap.

— patientx, Kijai

49 frames error / max_seq_len error

Problem: I2V fails with sequence length errors

Solution: Use 81 frames minimum for I2V. This is hardcoded in the model.

Quality Issues

FP8_fast quantization causes artifacts

Problem: Color/noise issues with fp8

Solution: Keep img_embed weights at fp32. Native implementation has yellowish hue with fp8.

— Kijai

Tiled VAE decode fixes washed out frames

Problem: Regular VAE decode causes extremely washed out frames (except first)

Solution: Use tiled VAE decode.

— Screeb

Video colorspace/color shift

Problem: Colors look different after save/load cycle

Solution: Use Load Video (FFmpeg) instead of VHS Load Video for correct colorspace handling.

— Austin Mroz

Long video color drift

Problem: Color shifts become pronounced in I2V beyond 1 minute

Solution: Use SVI Pro LoRA for chained generations, or accept the limitation for streaming modes.

Performance Issues

Sampler V2 preview not showing

Problem: Live preview not working on new sampler

Solution: Change 'Live Preview Method' in ComfyUI settings to latent2rgb.

— lostintranslation

SAM3 masking VRAM leak

Problem: VRAM leak when using SAM3 masking

Solution: Run SAM3, disable it, then load the video for the mask.

— mdkb

FP8 model loading slow in native

Problem: FP8 model takes 10 seconds to load in native

Solution: Use Kijai's wrapper - FP8 loads in 1 second.

— hicho
Shift Values & Settings

Shift values for distilled LoRAs

Problem: Unclear what shift value to use with distilled LoRAs

Solution: Use shift=5 for Wan distilled LoRAs. Higher resolution needs higher shift because more resolution increases signal at a given noise level.

— spacepxl

Flow shift affects motion

Lower flow shift (3-5) = better details. Too low = coherence issues. Flow shift also affects motion speed.

— Juampab12, ezMan

🔗 Resources

Official Repositories

ComfyUI Integration

Control & Fun Variants

Character & Audio Models

Optimization

Community

  • Banodoco Discord - #wan_chatter, #wan_training, #wan_comfyui, #wan_resources
  • Civitai - Community LoRAs and workflows

Tips from the Community

  • 2x VAE upscaler: spacepxl's decoder acts as free 2x upscaler, kills noise grid patterns
  • VitPose for animals: Use thickness=20 for animal motion in SCAIL
  • Model mixing: Using 2.1 as high model + 2.2 low gives content of 2.1 with look of 2.2