XVerse: Precise Multi-Subject Image Generation with Independent Identity and Attribute Control

XVerse: Precise Multi-Subject Image Generation with Independent Identity and Attribute Control
Paper & Code
XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
2025 bytedance/XVerse
603

Generating realistic images with multiple distinct subjects—each retaining their unique identity and visual attributes like pose, lighting, or clothing style—has long been a major challenge in text-to-image AI. Most existing models suffer from subject blending, identity leakage, or entangled attribute control, making them unreliable for professional or production use.

XVerse directly addresses this problem. Built on Diffusion Transformers (DiTs), it introduces a novel modulation mechanism that injects reference subject information into the text stream without altering the underlying image latents. The result? High-fidelity, editable images where each subject behaves independently and predictably. Whether you’re placing two branded characters in a single advertisement or generating synthetic training data with precise visual controls, XVerse delivers consistency where others fail.

Why Multi-Subject Control Has Been So Hard

Traditional text-to-image models treat the entire prompt as a global instruction. When multiple subjects are described—say, “a red-haired man and a woman in a blue dress”—the model often conflates their features, especially if similar descriptors (e.g., “smiling,” “standing”) are used. Worse, when reference images are introduced for personalization (as in IP-Adapter or LoRA-based methods), the model may overfit, produce artifacts, or lose coherence across subjects.

This limitation severely hampers real-world applications: marketing teams can’t guarantee logo or character consistency; filmmakers can’t prototype scenes with stable protagonists; and computer vision researchers struggle to generate diverse yet controllable multi-person datasets.

XVerse solves these issues through token-specific text-stream modulation. Instead of fusing reference image features into the denoising process (which risks disrupting latent structure), it converts each reference into fine-grained offsets applied only to the text tokens that describe that subject. This decouples identity from semantics and enables per-subject editing without cross-contamination.

Key Technical Innovations

Token-Level Modulation Without Latent Distortion

XVerse’s core insight is that subject control should happen in the conditioning space—not the diffusion space. By encoding reference images into modulation vectors that adjust specific text embeddings (e.g., the embedding for “a golden retriever” in the prompt), the model preserves the integrity of the DiT’s internal latents. This approach avoids the feature corruption and mode collapse commonly seen in cross-attention injection methods.

Independent Identity and Semantic Control

Each subject can be governed separately:

  • Identity consistency is maintained via reference image offsets tied to their descriptive tokens.
  • Semantic attributes (pose, style, lighting) can be freely adjusted through the prompt or generation parameters without affecting other subjects.

This separation means you can re-pose one character, change another’s outfit, and adjust global lighting—all in a single coherent output.

Practical Prompt Engineering with Placeholders

XVerse simplifies complex prompting through structured placeholders like ENT1, ENT2, etc. Users provide a reference image with a descriptive caption (e.g., “a scientist with curly hair”), and then use ENT1 in their main prompt:

ENT1 presenting results to ENT2 in a futuristic lab.”

The system automatically substitutes the full caption at inference time. This not only streamlines prompt creation but enforces the strict alignment between reference and description that ensures reliable generation.

Getting Started: From Demo to Integration

Try It Instantly with the Gradio Interface

The included run_gradio.py script launches an intuitive web UI where you can:

  • Upload 1–3 reference images.
  • Auto-generate or manually write captions.
  • Toggle per-image “ID mode” to activate identity preservation.
  • Adjust control weights (weight_id, weight_ip) and LoRA scales to balance fidelity vs. naturalness.

Crucially, only expanded image panels are processed—giving users clear control over which subjects participate.

Command-Line Inference for Automation

For scripted workflows, inference_single_sample.py supports both single and multi-subject generation:

python inference_single_sample.py   --prompt "ENT1 and ENT2 dancing under moonlight"   --images "person1.jpg" "person2.jpg"   --captions "a dancer in red" "a dancer in silver"   --idips true true   --save_path "output.png"  

The toolchain integrates Florence-2 for auto-captioning, SAM2 for segmentation, and InsightFace for identity encoding—abstracting away complex preprocessing.

Hardware Considerations and Optimization

XVerse is designed for real-world deployment:

  • Standard inference requires a GPU with ≥24GB VRAM.
  • Low-VRAM mode (--use_low_vram) enables 2-subject generation on 24GB cards by offloading modules to CPU.
  • Ultra-low mode (--use_lower_vram) supports 3 subjects on 16GB GPUs (e.g., consumer RTX 4080), albeit with slower speed.
  • Quantized models (bnb-nf4 or GGUF) further reduce memory footprint—enabling up to 4-condition generation on 24GB VRAM when combined with CPU offloading.

Note: Quantization may slightly reduce output quality and often requires retuning of weight_id, weight_ip, and LoRA scales.

Also critical: prompt structure is strict. The exact caption text must appear verbatim in the prompt (or via ENT substitution). Omitting it will cause generation failure—a trade-off for precision.

Ideal Use Cases for Technical Teams

  • Branded Content Generation: Maintain consistent character or product appearance across ad variations.
  • Visual Storyboarding: Generate sequential scenes with stable protagonists and editable environments.
  • Synthetic Data for CV: Create multi-person datasets with controlled identities, poses, and occlusions.
  • Virtual Try-On & Fashion: Show multiple models wearing different outfits in the same scene, preserving body proportions and garment details.

In all these scenarios, XVerse eliminates the need for manual inpainting, post-editing, or error-prone prompt engineering.

Summary

XVerse redefines what’s possible in controlled multi-subject image synthesis. By decoupling identity and semantic control through token-level DiT modulation, it delivers unprecedented fidelity, editability, and compositional flexibility. With support for consumer-grade hardware, an accessible Gradio demo, and clear integration pathways, it’s a practical choice for teams seeking reliable, production-ready generation—not just research-grade novelty.

For project and technical decision-makers weighing AI image generation tools, XVerse offers a rare combination: academic rigor, engineering pragmatism, and real-world applicability.