SEED-Voken: Scalable, High-Fidelity Visual Tokenization for Autoregressive Image and Video Generation

SEED-Voken: Scalable, High-Fidelity Visual Tokenization for Autoregressive Image and Video Generation
Paper & Code
Scalable Image Tokenization with Index Backpropagation Quantization
2025 TencentARC/SEED-Voken
984

SEED-Voken is an open-source toolkit developed by Tencent ARC that delivers state-of-the-art visual tokenizers tailored for autoregressive visual generation. Built on a novel quantization method called Index Backpropagation Quantization (IBQ), SEED-Voken overcomes critical limitations in traditional vector quantization—such as codebook collapse, unstable training, and poor scalability—enabling large, high-dimensional, and consistently utilized codebooks. This makes it uniquely suited for next-generation image and video synthesis models that demand both fidelity and efficiency.

For practitioners working on generative vision systems—from research labs to product teams—SEED-Voken offers pretrained models, training scripts, and hardware flexibility (supporting both NVIDIA GPUs and Huawei NPUs), lowering the barrier to high-quality visual tokenization without sacrificing performance or scalability.

The Problem with Traditional Visual Tokenization

Visual tokenization—the process of converting continuous image or video features into discrete tokens—is foundational for autoregressive models like those used in MAGVIT, LlamaGen, or Show-o. Most approaches rely on vector quantization (VQ), where a learned codebook maps features to discrete indices.

However, standard VQ methods suffer from a well-known instability: during training, many codebook entries become underused or completely unused (“dead codes”), while the distribution of active codes diverges from the evolving visual features. This “codebook collapse” limits scalability—you can’t reliably increase codebook size or feature dimension without degrading reconstruction quality or training stability. As a result, prior tokenizers have been constrained to modest codebook sizes (e.g., 4K–16K codes) and lower dimensions, capping their capacity for detailed, high-resolution generation.

Introducing Index Backpropagation Quantization (IBQ)

SEED-Voken’s core innovation is IBQ, a new vector quantization technique that jointly optimizes the visual encoder and the entire codebook in a fully differentiable manner. Instead of relying on non-differentiable nearest-neighbor lookups, IBQ applies a straight-through estimator to a one-hot categorical distribution derived from the similarity between encoded features and all codebook vectors.

This design ensures:

  • All codebook entries remain differentiable, allowing gradient flow to every embedding during backpropagation.
  • Consistent latent alignment between the encoder’s output space and the codebook, preventing distribution drift.
  • High utilization even at massive scales—SEED-Voken achieves a codebook of 262,144 (2¹⁸) entries with 256-dimensional embeddings, a combination previously unattainable with stable training.

Experiments on ImageNet show that IBQ delivers superior reconstruction fidelity and enables high-quality autoregressive image generation, rivaling or outperforming existing tokenizers like VQGAN, TiTok, and OmniTokenizer.

Practical Capabilities: Beyond Theory

SEED-Voken isn’t just a research prototype—it’s a production-ready toolkit with two main components:

IBQ for Scalable Image Tokenization

The IBQ tokenizer supports ultra-large codebooks and is ideal for high-fidelity image generation. Pretrained checkpoints are available for codebook sizes of 16,384 and 262,144, trained on large-scale datasets like LAION and CC12M. These models achieve state-of-the-art performance in both reconstruction (e.g., low rFID) and downstream autoregressive generation quality.

Open-MAGVIT2 for Image and Video

Complementing IBQ, SEED-Voken includes Open-MAGVIT2, an open implementation supporting both image and video tokenization. The video tokenizer has demonstrated SOTA results on benchmarks compared to LARP, SweetTokenizer, and OmniTokenizer. This makes SEED-Voken one of the few toolkits that natively handle both modalities under a unified architecture.

Getting Started: Flexible and Accessible

SEED-Voken is designed for real-world adoption:

  • Hardware Support: Fully compatible with NVIDIA GPUs (e.g., V100) and Huawei Ascend 910B NPUs. Training and inference performance are nearly identical across platforms.
  • Pretrained Models: Ready-to-use tokenizers for text-to-image generation (16K and 262K codebooks) are publicly available, eliminating the need for costly pretraining unless custom domains are required.
  • Dataset Compatibility: Supports standard formats:
    • Image: ImageNet-2012
    • Video: UCF-101 (compatible with VideoGPT preprocessing)
    • Text-to-Image: WebDataset-formatted tar files (e.g., LAION, CC12M)
  • Framework: Built on PyTorch and PyTorch Lightning, with clear training and evaluation scripts provided in the repository.

Integration is straightforward: load a pretrained tokenizer, encode your images or videos into discrete tokens, and feed them into your autoregressive model—no modifications needed.

Limitations and Considerations

While SEED-Voken excels in autoregressive visual generation, it’s important to note its scope:

  • Task-Specific: Optimized for generative modeling, not general vision tasks like classification or detection.
  • Compute Requirements: Full pretraining from scratch demands large datasets and significant compute (though inference with pretrained models is lightweight).
  • Environment Constraints: NPU training requires specific CANN and PyTorch-NPU versions (e.g., CANN 8.0.T13 for images), which may require environment setup adjustments.

However, for teams focused on scalable, high-quality visual synthesis—especially those exploring large codebooks or multimodal architectures—these trade-offs are well worth it.

Summary

SEED-Voken solves a foundational bottleneck in autoregressive visual generation: scalable, stable, and high-fidelity tokenization. With its IBQ method, it enables codebooks an order of magnitude larger than previous approaches while maintaining high utilization and reconstruction quality. Combined with Open-MAGVIT2’s support for both images and videos, and its compatibility with mainstream hardware, SEED-Voken is a compelling choice for researchers and engineers building the next generation of generative vision systems. Whether you’re fine-tuning a text-to-image model or developing a video synthesis pipeline, SEED-Voken provides the tokenization backbone you need—open, efficient, and future-ready.