LTX Video 2
19B parameter video generation model with integrated audio, released January 5, 2026
Overview
LTX Video 2 is a 19B parameter video generation model released by Lightricks in partnership with NVIDIA on January 5, 2026. It features integrated audio generation, built-in upscaling, and supports text-to-video, image-to-video, and video-to-video generation.
Key Features
- Integrated Audio: Generates spatially-aware audio that responds to visual content
- Fast Generation: Near real-time with distilled model (~6 seconds for 121 frames at 720p on RTX 4090)
- Built-in Upscaling: Spatial and temporal upscalers included
- 24 FPS Output: Higher frame rate than most competitors
- Low VRAM Option: Can run on 8GB VRAM with RAM offloading
Model Variants
| Model | Size | Notes |
|---|---|---|
| ltx-video-2-1 (fp8) | ~27GB | Full model with VAE + audio, recommended for quality |
| ltx-video-2-1 (GGUF Q8) | ~20GB | Quantized version, good balance of quality/size |
| Distilled LoRA | +384MB | 8-step generation instead of 20, slight quality trade-off |
Hardware Requirements
--reserve-vram 20 flag.
| GPU | VRAM | Capability |
|---|---|---|
| RTX 5090 | 32GB | 832x480, 241 frames with fp8 + distill LoRA |
| RTX 4090 | 24GB | 720p, 121 frames (~6 sec with distilled) |
| RTX 3090 | 24GB | 720p generation confirmed working |
| RTX 4070 Ti Super | 16GB | Works with GGUF models + offloading |
| 8GB cards | 8GB | Possible with 64GB+ RAM, heavy offloading |
RAM Requirements for Offloading
When using VRAM offloading, ensure you have sufficient system RAM. 64GB recommended for comfortable operation with 8GB VRAM GPUs.
ā KijaiRecommended Settings
| Parameter | Recommended | Notes |
|---|---|---|
| Resolution | 1280x720 or higher | Must be divisible by 32. Below 720p tends to perform poorly. |
| Duration | ā¤10 seconds (official) | 20s works for some users. Quality degrades at 30s+ |
| Steps | 20 (base) / 8 (distilled) | Use distilled LoRA for faster generation |
| Scheduler | Euler (default) | Euler_A better for anime/art content |
| Model precision | fp8 | Preferred over GGUF for quality |
| Frame count formula | (8n)+1 | VAE compresses 8 frames to 1 latent |
Technical Discoveries
Audio Generation
Audio is spatially aware
Audio changes based on position - footsteps get louder as character approaches camera.
ā LodisAudio continuation maintains voice consistency
Can take audio from video input and continue generating with same voice characteristics.
ā harelcainMulti-language support
Supports multiple languages including Hindi, Russian, and Chinese for audio generation.
ā Govind SinghContext-aware accents
Model generated Indian accent when Indian doctors appeared in video without being prompted for accent.
ā TachyonArchitecture & Performance
All-in-one model file
27GB fp8 model includes video (19B params), audio processing, and VAE all in one file.
ā AdaVAE compression ratio
VAE compresses 8 frames to 1 latent frame. 16 latents decode to 121 pixel frames.
ā DragonyteNear real-time generation
121 frames at 720p generates in ~6 seconds on RTX 4090 with distilled model.
ā KijaiDepth information included
LTX Video 2 includes depth information in decoded latents.
ā KijaiI2V & Generation Modes
I2V works better at higher resolutions
1280x720 works well, below that tends to perform poorly.
ā TachyonMultiple input modes supported
Text-to-video, image-to-video, video-to-video, and audio-to-video generation capabilities.
ā lŅuŅcŅiŅfŅeŅrŅPrompt strength vs LoRA
Camera movement prompts override static camera LoRA settings.
ā burgstallKnown Limitations
Portrait orientation issues
Portrait aspect ratios cause issues with generation quality and motion. Stick to landscape.
ā CubeyDuration limits
Out of distribution breakdown at 30 seconds, probably best to keep it to 20 seconds max.
ā sometimesTwitchyComplex motion breakdown
Model struggles with complex motion like gymnastics where anatomy breaks down during flips.
ā dj47Text generation is weak
Model struggles with proper looking text generation.
ā harelcainLimited anime training
The dataset is mainly cinematic landscape videos, wasn't trained on many anime.
ā dj47832x480 performs poorly
Most results had no motion, and all outputs didn't look very good at this resolution.
ā CubeyTroubleshooting
Common Errors & Fixes
Problem: ModuleNotFoundError for audio_vae
Solution: Install version ā„0.3.0 of ComfyUI-LTXVideo custom node.
Problem: Sampling errors during generation
Solution: Disable live preview in ComfyUI settings.
ā CubeyProblem: CUDA out of memory
Solution: Use GGUF Q8 model instead of fp8, enable RAM offloading with --reserve-vram flag, or reduce resolution/frame count.
Problem: No motion in output
Solution: Increase resolution to at least 720p. Lower resolutions like 832x480 often produce static results.
Problem: Audio not generating
Solution: Ensure you're using the full model file that includes audio VAE, not a video-only variant.
Quality Issues
Problem: Blurry output at 4K
Solution: Generate at 720p/1080p and use external upscaler. Native 4K still has blurriness issues.
Problem: Anatomy breakdown in complex motion
Solution: Avoid complex motions like flips/gymnastics. Use shorter clips and simpler movements.
Problem: Quality degradation at long durations
Solution: Keep clips under 20 seconds. For longer content, generate multiple clips and stitch together.
Workflows & Tips
Vid2vid with partial latent masking
Use latent masking to preserve specific regions while regenerating others. Useful for fixing artifacts or changing specific elements.
Use fp8 instead of GGUFs
fp8 models generally produce better quality than GGUF quantized versions when VRAM allows.
Temporal upscaler for smooth motion
Use built-in temporal latent upscaler to achieve effective double frame rate and reduce deformations.
ā harelcainDistilled LoRA for iteration speed
Use ltx-2-19b-distilled-lora-384.safetensors at 0.6 weight for 8-step generation when iterating on prompts.
ā AdaModel Comparisons
LTX Video 2 vs Wan 2.2
| Aspect | LTX Video 2 | Wan 2.2 |
|---|---|---|
| Speed | Faster (near real-time with distilled) | Slower |
| Audio | Built-in, spatially aware | No native audio |
| Dynamic motion | Much better in same duration | More conservative |
| Fidelity | Higher | Good |
| Control options | Limited currently | VACE, Fun models, more mature |
Training & LoRAs
Available LoRAs
- Distilled LoRA: ltx-2-19b-distilled-lora-384.safetensors - 8-step generation
- Static Camera LoRA: Reduces camera movement (but can be overridden by prompts)
Training Tips
- Target the base model, not distilled
- Use diverse training data with multiple angles/lighting
- Short clips (5-10 seconds) work best for training data
Resources
Official Links
ComfyUI Integration
- ComfyUI-LTXVideo - Official custom nodes
- Kijai's LTXVideo Wrapper - Extended features
Community Resources
- Banodoco Discord - #ltx_chatter, #ltx_training, #ltx_resources
- Civitai - Community LoRAs and models