EliGen: Achieve Precise Entity-Level Control in AI Image Generation Without Retraining Models

Paper & Code

EliGen: Entity-Level Controlled Image Generation with Regional Attention

2025 • modelscope/DiffSynth-Studio.git

★11062

Text-to-image diffusion models have revolutionized creative workflows, but they still struggle with a fundamental limitation: global prompts alone often fail to deliver precise control over individual objects in a generated image. Want a “red sports car on the left” and a “golden retriever sitting on the right”? Standard models might mix up positions, ignore attributes, or inconsistently render entities—especially when multiple objects are involved.

Enter EliGen: a lightweight, open-source framework that enables fine-grained, entity-level control over image generation without requiring model retraining or complex pipeline redesigns. Built on top of diffusion transformers, EliGen introduces regional attention—a parameter-free mechanism that integrates spatial masks with entity-specific prompts to guide where and how each object appears. This makes it uniquely suited for applications demanding pixel-accurate layout control, from e-commerce visuals to concept art.

Developed as part of the DiffSynth-Studio ecosystem, EliGen is not just a research prototype—it’s a practical tool with pretrained models, ready-to-run examples, and seamless compatibility with popular community extensions like IP-Adapter and In-Context LoRA.

What Problem Does EliGen Solve?

Traditional text-to-image models interpret prompts holistically. While phrases like “a cat wearing sunglasses” work well in isolation, adding spatial or relational constraints—such as “a cat on the left wearing sunglasses, and a coffee cup on the right”—often leads to inconsistent or inaccurate outputs. The model lacks explicit grounding of text phrases to image regions.

EliGen directly addresses this by decoupling global scene description from local entity control. Instead of relying on a single, ambiguous prompt, users define:

Spatial masks: binary or soft masks indicating where each entity should appear.
Entity prompts: concise text descriptions tied to each masked region (e.g., “vintage red convertible”, “steaming ceramic mug”).

This combination empowers creators to enforce layout integrity while preserving the high visual quality expected from modern diffusion models.

Key Technical Innovations

Regional Attention: Precision Without Extra Parameters

At the heart of EliGen is regional attention, a novel attention mechanism designed for diffusion transformers. Unlike prior methods that inject spatial information via additional cross-attention layers or ControlNets, regional attention operates within the existing transformer architecture—requiring zero extra parameters.

It works by modulating attention weights based on user-provided masks, effectively restricting each entity’s influence to its designated region during denoising. This ensures that the “cat” prompt only affects pixels inside the cat mask, preventing semantic leakage or object drift.

High-Quality Entity-Annotated Training Data

EliGen’s effectiveness is further amplified by its dedicated training dataset—EliGenTrainSet—which includes fine-grained annotations for both semantic entities and their exact spatial boundaries. Later versions, such as Qwen-Image-EliGen-V2, leverage the Qwen-Image-Self-Generated-Dataset, ensuring alignment with the base model’s native style and distribution.

This data-centric approach enables robust generalization across diverse entity types and compositions, outperforming existing methods in both spatial precision and visual fidelity.

Inpainting Fusion for Multi-Entity Editing

Beyond generation, EliGen includes an inpainting fusion pipeline that extends its capabilities to image editing. Users can replace or refine multiple entities simultaneously by masking target regions and providing new prompts—ideal for iterative design or A/B testing of product placements in mockups.

Ecosystem Compatibility

EliGen is designed to integrate smoothly with the broader open-source ecosystem. It works out of the box with:

IP-Adapter for reference-image-guided generation,
In-Context LoRA for structural conditioning (e.g., pose, depth),
and multimodal large language models (MLLMs) for dynamic prompt refinement.

This modularity makes EliGen a versatile augmentation rather than a siloed solution.

Practical Use Cases

E-Commerce & Advertising

Imagine designing a promotional poster where specific products must appear in predefined zones—shoes on the bottom left, a handbag on the top right, with accurate colors and styles. EliGen ensures each product renders exactly as described in its assigned region, eliminating manual post-editing.

The Qwen-Image-EliGen-Poster variant, co-developed with Taobao’s Experience Design Team, is explicitly optimized for such scenarios.

Digital Art & Storyboarding

Artists and animators often need precise control over character placement, costume details, or background elements. With EliGen, they can define “hero character in center wearing blue armor” and “dragon coiled in upper-right cloud” as separate entities, achieving compositionally coherent results in a single generation pass.

UI/UX Prototyping

Product designers can generate mockups with labeled UI components (“login button here”, “user avatar there”) to validate layout ideas before engineering implementation—accelerating early-stage ideation.

Getting Started with EliGen

Using EliGen is straightforward:

Prepare spatial masks: Define binary or soft masks for each entity region (e.g., using a simple image editor or segmentation model).
Write entity prompts: Craft short, descriptive prompts for each mask (e.g., “black leather sofa”, “potted monstera plant”).
Run inference: Load an EliGen-enabled pipeline—such as DiffSynth-Studio/Qwen-Image-EliGen or DiffSynth-Studio/Eligen for FLUX.1—and pass the masks and prompts as inputs.

The DiffSynth-Studio GitHub repository provides ready-to-use examples in /examples/EntityControl/ and /examples/qwen_image/, including low-VRAM inference configurations for consumer GPUs.

Pretrained models are available on ModelScope and Hugging Face, and the framework supports LoRA-based fine-tuning if you wish to adapt EliGen to your own domain data.

Limitations and Considerations

While powerful, EliGen has practical constraints to keep in mind:

Manual mask preparation: EliGen does not auto-generate layouts. Users must provide spatial masks, which may require upfront effort (though tools like SAM can assist).
Image-only focus: Currently, EliGen targets static image synthesis and is not designed for video or 3D generation.
Model dependency: Best results are achieved with models fine-tuned on EliGen’s dataset (e.g., Qwen-Image-EliGen series). Using it with arbitrary base models may yield suboptimal alignment.

That said, its lightweight design and plug-and-play nature make it one of the most accessible solutions for entity-level control available today.

Summary

EliGen bridges a critical gap in AI image generation: the ability to control individual objects with spatial and semantic precision—without retraining models or compromising quality. By combining regional attention, high-quality annotations, and ecosystem-friendly design, it empowers developers, designers, and researchers to build more reliable, controllable, and production-ready generative workflows. Whether you’re crafting e-commerce visuals, digital art, or interactive prototypes, EliGen offers a practical path toward truly compositional image synthesis.