LTX Video 2

19B parameter video generation model with integrated audio, released January 5, 2026

šŸ“… Last updated: February 1, 2026 šŸ’¬ Source: ~44,500 Discord messages šŸ“Š ~4,345 knowledge items
Chat with this knowledge base This content is also available in NotebookLM for interactive Q&A.

šŸ“– Overview

LTX Video 2 is a 19B parameter video generation model released by Lightricks in partnership with NVIDIA on January 5, 2026. It features integrated audio generation, built-in upscaling, and supports text-to-video, image-to-video, and video-to-video generation.

Key Features

  • Integrated Audio: Generates spatially-aware audio that responds to visual content
  • Fast Generation: Near real-time with distilled model (~6 seconds for 121 frames at 720p on RTX 4090)
  • Built-in Upscaling: Spatial and temporal upscalers included
  • 24 FPS Output: Higher frame rate than most competitors
  • Low VRAM Option: Can run on 8GB VRAM with RAM offloading

Model Variants

Model Size Notes
ltx-video-2-1 (fp8) ~27GB Full model with VAE + audio, recommended for quality
ltx-video-2-1 (GGUF Q8) ~20GB Quantized version, good balance of quality/size
Distilled LoRA +384MB 8-step generation instead of 20, slight quality trade-off

šŸ–„ļø Hardware Requirements

Good news for low VRAM users LTX Video 2 can run on 8GB VRAM with 64GB+ system RAM using offloading. Use --reserve-vram 20 flag.
GPU VRAM Capability
RTX 5090 32GB 832x480, 241 frames with fp8 + distill LoRA
RTX 4090 24GB 720p, 121 frames (~6 sec with distilled)
RTX 3090 24GB 720p generation confirmed working
RTX 4070 Ti Super 16GB Works with GGUF models + offloading
8GB cards 8GB Possible with 64GB+ RAM, heavy offloading

RAM Requirements for Offloading

When using VRAM offloading, ensure you have sufficient system RAM. 64GB recommended for comfortable operation with 8GB VRAM GPUs.

— Kijai

āš™ļø Recommended Settings

Parameter Recommended Notes
Resolution 1280x720 or higher Must be divisible by 32. Below 720p tends to perform poorly.
Duration ≤10 seconds (official) 20s works for some users. Quality degrades at 30s+
Steps 20 (base) / 8 (distilled) Use distilled LoRA for faster generation
Scheduler Euler (default) Euler_A better for anime/art content
Model precision fp8 Preferred over GGUF for quality
Frame count formula (8n)+1 VAE compresses 8 frames to 1 latent
Resolution Warning Resolutions must be divisible by 32. Portrait orientations don't work well and cause quality/motion issues.

šŸ”¬ Technical Discoveries

Audio Generation

Audio is spatially aware

Audio changes based on position - footsteps get louder as character approaches camera.

— Lodis

Audio continuation maintains voice consistency

Can take audio from video input and continue generating with same voice characteristics.

— harelcain

Multi-language support

Supports multiple languages including Hindi, Russian, and Chinese for audio generation.

— Govind Singh

Context-aware accents

Model generated Indian accent when Indian doctors appeared in video without being prompted for accent.

— Tachyon
Architecture & Performance

All-in-one model file

27GB fp8 model includes video (19B params), audio processing, and VAE all in one file.

— Ada

VAE compression ratio

VAE compresses 8 frames to 1 latent frame. 16 latents decode to 121 pixel frames.

— Dragonyte

Near real-time generation

121 frames at 720p generates in ~6 seconds on RTX 4090 with distilled model.

— Kijai

Depth information included

LTX Video 2 includes depth information in decoded latents.

— Kijai
I2V & Generation Modes

I2V works better at higher resolutions

1280x720 works well, below that tends to perform poorly.

— Tachyon

Multiple input modes supported

Text-to-video, image-to-video, video-to-video, and audio-to-video generation capabilities.

— l҈u҈c҈i҈f҈e҈r҈

Prompt strength vs LoRA

Camera movement prompts override static camera LoRA settings.

— burgstall

āš ļø Known Limitations

Portrait orientation issues

Portrait aspect ratios cause issues with generation quality and motion. Stick to landscape.

— Cubey

Duration limits

Out of distribution breakdown at 30 seconds, probably best to keep it to 20 seconds max.

— sometimesTwitchy

Complex motion breakdown

Model struggles with complex motion like gymnastics where anatomy breaks down during flips.

— dj47

Text generation is weak

Model struggles with proper looking text generation.

— harelcain

Limited anime training

The dataset is mainly cinematic landscape videos, wasn't trained on many anime.

— dj47

832x480 performs poorly

Most results had no motion, and all outputs didn't look very good at this resolution.

— Cubey

šŸ”§ Troubleshooting

Common Errors & Fixes

Problem: ModuleNotFoundError for audio_vae

Solution: Install version ≄0.3.0 of ComfyUI-LTXVideo custom node.

Problem: Sampling errors during generation

Solution: Disable live preview in ComfyUI settings.

— Cubey

Problem: CUDA out of memory

Solution: Use GGUF Q8 model instead of fp8, enable RAM offloading with --reserve-vram flag, or reduce resolution/frame count.

Problem: No motion in output

Solution: Increase resolution to at least 720p. Lower resolutions like 832x480 often produce static results.

Problem: Audio not generating

Solution: Ensure you're using the full model file that includes audio VAE, not a video-only variant.

Quality Issues

Problem: Blurry output at 4K

Solution: Generate at 720p/1080p and use external upscaler. Native 4K still has blurriness issues.

Problem: Anatomy breakdown in complex motion

Solution: Avoid complex motions like flips/gymnastics. Use shorter clips and simpler movements.

Problem: Quality degradation at long durations

Solution: Keep clips under 20 seconds. For longer content, generate multiple clips and stitch together.

šŸ”„ Workflows & Tips

Vid2vid with partial latent masking

Use latent masking to preserve specific regions while regenerating others. Useful for fixing artifacts or changing specific elements.

Use fp8 instead of GGUFs

fp8 models generally produce better quality than GGUF quantized versions when VRAM allows.

Temporal upscaler for smooth motion

Use built-in temporal latent upscaler to achieve effective double frame rate and reduce deformations.

— harelcain

Distilled LoRA for iteration speed

Use ltx-2-19b-distilled-lora-384.safetensors at 0.6 weight for 8-step generation when iterating on prompts.

— Ada

āš–ļø Model Comparisons

LTX Video 2 vs Wan 2.2

Aspect LTX Video 2 Wan 2.2
Speed Faster (near real-time with distilled) Slower
Audio Built-in, spatially aware No native audio
Dynamic motion Much better in same duration More conservative
Fidelity Higher Good
Control options Limited currently VACE, Fun models, more mature
Community consensus "LTX2 using full pagefile is still faster than wan2.2, has higher fidelity and better audio" — boop

šŸŽ“ Training & LoRAs

Training challenges The distilled model makes LoRA training difficult. Most community training efforts target the base model.

Available LoRAs

  • Distilled LoRA: ltx-2-19b-distilled-lora-384.safetensors - 8-step generation
  • Static Camera LoRA: Reduces camera movement (but can be overridden by prompts)

Training Tips

  • Target the base model, not distilled
  • Use diverse training data with multiple angles/lighting
  • Short clips (5-10 seconds) work best for training data

šŸ”— Resources

Official Links

ComfyUI Integration

Community Resources