Qwen-Image: Generate and Edit Images with Perfect Text—Even in Chinese

Paper & Code

2025 • QwenLM/Qwen-Image

★6339

If you’ve ever struggled to generate marketing visuals with legible multilingual text—or tried to edit a product image only to end up with a distorted face or garbled logo—you’re not alone. Most open-source text-to-image (T2I) models falter when faced with real-world demands: rendering accurate English and Chinese text, preserving identity during edits, or understanding complex layout instructions.

Enter Qwen-Image, a 20-billion-parameter open-source foundation model from Alibaba’s Qwen series. Designed specifically to solve these pain points, Qwen-Image sets a new bar for two critical capabilities: complex text rendering (including logographic scripts like Chinese) and precise, consistent image editing—all under the permissive Apache 2.0 license.

Unlike generic diffusion models that treat text as an afterthought, Qwen-Image integrates advanced vision-language alignment from the ground up, enabling reliable generation and editing where others fail.

Why Text Accuracy Matters—and How Qwen-Image Delivers

The Text-in-Image Challenge

Traditional T2I models often misrender letters, scramble words, or ignore typographic details—especially with non-Latin scripts. This isn’t just a cosmetic flaw; it’s a dealbreaker for commercial applications like e-commerce banners, multilingual social media posts, or brand-compliant advertising.

Qwen-Image tackles this head-on through a curriculum-based training pipeline:

Starts with non-text scenes, then introduces simple labels, and finally scales to paragraph-level descriptions.
Uses large-scale, balanced datasets with real and synthesized text in both alphabetic (e.g., English) and logographic (e.g., Chinese) languages.

The result? State-of-the-art performance on benchmarks like T2I-CoreBench, where Qwen-Image outperforms other open-source models and rivals proprietary systems in compositional reasoning and text fidelity.

Real-World Editing Without Breaking Identity

Editing an existing image is notoriously unstable. Change a person’s outfit, and their face might morph. Replace product text, and the logo vanishes.

Qwen-Image-Edit—particularly the Qwen-Image-Edit-2509 variant—addresses this with:

A dual-encoding mechanism: The original image is processed separately by Qwen2.5-VL (for semantics) and a VAE encoder (for visual reconstruction). This balance preserves identity while allowing creative changes.
Multi-image editing support: Combine 1–3 inputs (e.g., “person + product” or “two people facing each other”) and generate coherent composites.
Enhanced consistency for people (facial identity across poses), products (brand logo integrity), and text (editable fonts, colors, and materials).

Practical Applications for Teams and Developers

Qwen-Image isn’t just a research curiosity—it’s built for real projects:

Multilingual marketing: Generate posters with accurate English and Chinese slogans without manual touch-ups.
E-commerce automation: Upload a plain product photo and generate branded posters with consistent logos and stylized backgrounds.
Avatar and meme creation: Edit portraits while preserving facial identity, even under dramatic style shifts.
Legacy photo enhancement: Restore old photos while maintaining realistic textures and readable contextual text.
Design tool integration: Leverage native ControlNet support (depth, edges, keypoints) for pose-guided editing or sketch-to-image workflows.

Getting Started: Simple Code, Immediate Results

Qwen-Image integrates seamlessly with Hugging Face Diffusers, requiring minimal setup:

Text-to-Image Generation

from diffusers import DiffusionPipeline  
import torch  

pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image", torch_dtype=torch.bfloat16).to("cuda")  

prompt = 'A chalkboard reads "Qwen Coffee $2", beside a neon sign saying "Tong Yi Qian Wen"'  
image = pipe(  prompt=prompt + ", Ultra HD, 4K, cinematic composition.",  width=1664,  height=928,  num_inference_steps=50,  true_cfg_scale=4.0,  
).images[0]

Tip: Use the provided positive_magic suffixes for English or Chinese prompts to boost quality.

Image Editing (Single or Multi-Image)

For single-image edits:

from diffusers import QwenImageEditPipeline  
pipeline = QwenImageEditPipeline.from_pretrained("Qwen/Qwen-Image-Edit").to("cuda")  
output = pipeline(image=input_img, prompt="Make the rabbit purple, add flashlight background")

For multi-image compositing (Edit-2509):

from diffusers import QwenImageEditPlusPipeline  
pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2509")  
output = pipeline(image=[img1, img2], prompt="Magician bear on left, alchemist bear on right")

Critical note: Always use the official prompt rewriting tools (via polish_edit_prompt) for stable editing results. Raw prompts often lead to artifacts.

Limitations and Best Practices

While powerful, Qwen-Image has practical boundaries:

Prompt sensitivity: Editing quality heavily depends on rewritten prompts via Qwen-VL-Max. Skipping this step risks inconsistent outputs.
Dependencies: Requires transformers>=4.51.3 and the latest diffusers from GitHub.
Multi-image input: Works best with 1–3 images; more inputs may reduce coherence.
Hardware: Full-precision inference needs a modern GPU, but community tools like DiffSynth-Studio enable 4GB VRAM usage via offloading and FP8 quantization.

Ecosystem and Deployment Options

You don’t need to build from scratch. Qwen-Image is already integrated across major platforms:

Hugging Face Spaces and Qwen Chat for no-code demos.
ModelScope for LoRA training, FP8 quantization, and low-memory inference.
ComfyUI, LiblibAI, and WaveSpeedAI for workflow integration.
LeMiCa and cache-dit for 3× faster inference without quality loss.

For teams, the Multi-GPU API Server (included in the repo) supports concurrent requests, automatic prompt enhancement, and aspect ratio handling out of the box.

Summary

Qwen-Image solves two of the most persistent challenges in generative AI: faithful multilingual text rendering and identity-preserving image editing. Backed by rigorous training strategies and dual-encoding architecture, it delivers production-ready quality for real-world applications—from global marketing campaigns to personalized design automation. With open weights, Apache 2.0 licensing, and deep ecosystem support, it’s one of the most practical open-source image foundation models available today.

If your project demands accurate text-in-image or reliable visual editing, Qwen-Image is worth your immediate evaluation.