Wan: Open-Source, High-Performance Video Generation That Runs on Consumer GPUs

Paper & Code

Wan: Open and Advanced Large-Scale Video Generative Models

2025 • Wan-Video/Wan2.1

★14878

Overview

Video content is no longer a luxury—it’s a necessity. From dynamic marketing campaigns and immersive educational materials to personalized entertainment and creative prototyping, the demand for high-quality, controllable video generation is exploding. Yet most existing solutions fall short: they’re either closed-source black boxes, require prohibitively expensive hardware, or lack the flexibility needed for real-world applications.

Enter Wan, an open and advanced suite of large-scale video foundation models that directly addresses these pain points. Built on the diffusion transformer (DiT) architecture and powered by innovations in video VAEs, scalable training, and multilingual prompt understanding, Wan delivers state-of-the-art performance while remaining accessible to individual creators, researchers, and small teams.

Unlike proprietary alternatives, Wan is fully open-source—code, model weights, inference pipelines, and evaluation tools are all publicly available. This transparency enables customization, auditing, and seamless integration without vendor lock-in. Whether you’re generating videos from text, animating still images, or editing existing footage, Wan provides a powerful, efficient, and community-supported foundation.

Why Wan Stands Out

State-of-the-Art Performance—Open and Verifiable

Wan’s 14B-parameter model consistently outperforms both open-source competitors and leading commercial systems across multiple benchmarks. Trained on billions of images and videos, it demonstrates clear scaling laws: larger models and richer data lead to higher fidelity, better motion coherence, and more detailed scene rendering.

But what truly sets Wan apart is that this performance isn’t hidden behind a paywall or API rate limit. Every metric, every model, and every line of code is open for inspection and improvement—making it a rare example of SOTA-quality generation that’s also fully reproducible.

Runs on Consumer-Grade GPUs

High performance usually demands high cost. Not with Wan.

The 1.3B-parameter model requires only 8.19 GB of VRAM, meaning it runs smoothly on widely available consumer GPUs like the NVIDIA RTX 4090. On such hardware, it can generate a 5-second 480P video in about 4 minutes—without quantization or other aggressive optimizations. This democratizes access to professional-grade video synthesis for indie creators, students, and small studios who can’t afford multi-thousand-dollar cloud bills or datacenter-scale clusters.

One Model, Eight+ Tasks

Wan isn’t a single-purpose tool—it’s a comprehensive video generation platform. The suite supports:

Text-to-Video (T2V)
Image-to-Video (I2V)
First-Last-Frame-to-Video (FLF2V)
Video Editing (via VACE)
Text-to-Image (T2I)
Video-to-Audio (experimental)
Multilingual visual text generation (Chinese & English)

This versatility means teams can use a single model family across multiple workflows—reducing engineering overhead and ensuring consistency in visual style and quality.

Legible, Multilingual Text in Videos

Most video diffusion models fail when asked to render readable text. Wan changes that. It’s the first open video model capable of generating clear, contextually appropriate Chinese and English text—a crucial feature for educational content, localized advertising, subtitles, and UI demonstrations. This capability stems from its multilingual T5 text encoder and carefully curated training data that includes text-rich video scenes.

Full Openness with Production-Ready Tooling

Wan is licensed under Apache 2.0, granting users full freedom to use, modify, and distribute the models and code. More importantly, the project ships with production-grade infrastructure:

Diffusers integration for easy adoption in existing PyTorch pipelines
ComfyUI support for no-code/low-code creative workflows
Gradio demos for rapid local prototyping
Multi-GPU inference via FSDP and xDiT for scalable deployment
Pre-trained checkpoints for 1.3B and 14B variants at 480P and 720P

This ecosystem lowers the barrier to entry while supporting advanced use cases.

Real-World Applications

Wan isn’t just a research artifact—it’s already powering real projects:

EchoShot uses Wan2.1-T2V-1.3B to generate multi-shot portrait videos featuring the same character across scenes.
MagicTryOn builds virtual try-on experiences using Wan2.1-14B-I2V, preserving garment details during motion.
AniCrafter and UniAnimate-DiT animate human figures with 3D avatar control, enabling expressive pose-guided video synthesis.
ATI leverages Wan for trajectory-based motion control, unifying object, camera, and local movements in a single framework.

These community projects prove Wan’s extensibility and readiness for industrial applications.

Getting Started Is Simple

You can run your first generation in minutes:

git clone https://github.com/Wan-Video/Wan2.1.git  
cd Wan2.1  
pip install -r requirements.txt  
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B

Then generate a video with a single command:

python generate.py --task t2v-1.3B --size 832x480 --ckpt_dir ./Wan2.1-T2V-1.3B \  
  --offload_model True --t5_cpu --prompt "A panda chef cooking dumplings in a bamboo forest."

For higher quality, enable prompt extension using a local Qwen model or Dashscope API—Wan automatically enriches sparse prompts into detailed scene descriptions, significantly improving output fidelity.

Practical Considerations

While Wan is powerful, users should be aware of a few nuances:

The 1.3B model is optimized for 480P; 720P generation is possible but less stable due to limited high-res training data.
First-Last-Frame-to-Video tasks perform best with Chinese prompts, as the training data for this mode is predominantly Chinese text-video pairs.
The 14B model benefits from multi-GPU inference (using FSDP + xDiT) to achieve reasonable generation speeds.
Prompt extension is highly recommended—raw prompts often yield inferior results compared to extended ones.

These aren’t limitations but rather guidance to help users achieve the best possible outcomes with available resources.

Integration Into Your Workflow

Wan fits naturally into modern AI stacks:

Use the Diffusers API if you’re already building with Hugging Face libraries.
Plug into ComfyUI for drag-and-drop video creation with LoRA support, FP8 quantization, and VRAM optimization (via DiffSynth-Studio).
Accelerate inference 2× with TeaCache, a compatible caching system.
Fine-tune or adapt models using standard PyTorch tooling—no special frameworks required.

Because everything is open, you retain full control over your pipeline—no surprise API deprecations, usage caps, or opaque updates.

A Thriving Community

Wan’s openness has already sparked a vibrant ecosystem. Beyond the official repo, community members are building:

Animation frameworks
Virtual try-on systems
Multi-subject reference video generators
Motion control extensions

This organic growth signals long-term viability and ensures continuous improvement beyond the core team’s efforts.

Summary

If you need a video generation solution that combines cutting-edge quality, hardware accessibility, task versatility, and true openness, Wan stands as one of the strongest choices available today. It removes the traditional trade-offs between performance and cost, between capability and control.

For creators, researchers, and developers tired of black-box commercial APIs or unstable open models, Wan offers a transparent, efficient, and future-proof alternative—ready to run on your desktop and scale to your ambitions.