Qwen-Image: Generate and Edit Images with Perfect Text—Even in Chinese

Qwen-Image: Generate and Edit Images with Perfect Text—Even in Chinese
Paper & Code
Qwen-Image Technical Report
2025 QwenLM/Qwen-Image
6339

If you’ve ever struggled to generate marketing visuals with legible multilingual text—or tried to edit a product image only to end up with a distorted face or garbled logo—you’re not alone. Most open-source text-to-image (T2I) models falter when faced with real-world demands: rendering accurate English and Chinese text, preserving identity during edits, or understanding complex layout instructions.

Enter Qwen-Image, a 20-billion-parameter open-source foundation model from Alibaba’s Qwen series. Designed specifically to solve these pain points, Qwen-Image sets a new bar for two critical capabilities: complex text rendering (including logographic scripts like Chinese) and precise, consistent image editing—all under the permissive Apache 2.0 license.

Unlike generic diffusion models that treat text as an afterthought, Qwen-Image integrates advanced vision-language alignment from the ground up, enabling reliable generation and editing where others fail.

Why Text Accuracy Matters—and How Qwen-Image Delivers

The Text-in-Image Challenge

Traditional T2I models often misrender letters, scramble words, or ignore typographic details—especially with non-Latin scripts. This isn’t just a cosmetic flaw; it’s a dealbreaker for commercial applications like e-commerce banners, multilingual social media posts, or brand-compliant advertising.

Qwen-Image tackles this head-on through a curriculum-based training pipeline:

  • Starts with non-text scenes, then introduces simple labels, and finally scales to paragraph-level descriptions.
  • Uses large-scale, balanced datasets with real and synthesized text in both alphabetic (e.g., English) and logographic (e.g., Chinese) languages.

The result? State-of-the-art performance on benchmarks like T2I-CoreBench, where Qwen-Image outperforms other open-source models and rivals proprietary systems in compositional reasoning and text fidelity.

Real-World Editing Without Breaking Identity

Editing an existing image is notoriously unstable. Change a person’s outfit, and their face might morph. Replace product text, and the logo vanishes.

Qwen-Image-Edit—particularly the Qwen-Image-Edit-2509 variant—addresses this with:

  • A dual-encoding mechanism: The original image is processed separately by Qwen2.5-VL (for semantics) and a VAE encoder (for visual reconstruction). This balance preserves identity while allowing creative changes.
  • Multi-image editing support: Combine 1–3 inputs (e.g., “person + product” or “two people facing each other”) and generate coherent composites.
  • Enhanced consistency for people (facial identity across poses), products (brand logo integrity), and text (editable fonts, colors, and materials).

Practical Applications for Teams and Developers

Qwen-Image isn’t just a research curiosity—it’s built for real projects:

  • Multilingual marketing: Generate posters with accurate English and Chinese slogans without manual touch-ups.
  • E-commerce automation: Upload a plain product photo and generate branded posters with consistent logos and stylized backgrounds.
  • Avatar and meme creation: Edit portraits while preserving facial identity, even under dramatic style shifts.
  • Legacy photo enhancement: Restore old photos while maintaining realistic textures and readable contextual text.
  • Design tool integration: Leverage native ControlNet support (depth, edges, keypoints) for pose-guided editing or sketch-to-image workflows.

Getting Started: Simple Code, Immediate Results

Qwen-Image integrates seamlessly with Hugging Face Diffusers, requiring minimal setup:

Text-to-Image Generation

from diffusers import DiffusionPipeline  
import torch  

pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image", torch_dtype=torch.bfloat16).to("cuda")  

prompt = 'A chalkboard reads "Qwen Coffee $2", beside a neon sign saying "Tong Yi Qian Wen"'  
image = pipe(  prompt=prompt + ", Ultra HD, 4K, cinematic composition.",  width=1664,  height=928,  num_inference_steps=50,  true_cfg_scale=4.0,  
).images[0]  

Tip: Use the provided positive_magic suffixes for English or Chinese prompts to boost quality.

Image Editing (Single or Multi-Image)

For single-image edits:

from diffusers import QwenImageEditPipeline  
pipeline = QwenImageEditPipeline.from_pretrained("Qwen/Qwen-Image-Edit").to("cuda")  
output = pipeline(image=input_img, prompt="Make the rabbit purple, add flashlight background")  

For multi-image compositing (Edit-2509):

from diffusers import QwenImageEditPlusPipeline  
pipeline = QwenImageEditPlusPipeline.from_pretrained("Qwen/Qwen-Image-Edit-2509")  
output = pipeline(image=[img1, img2], prompt="Magician bear on left, alchemist bear on right")  

Critical note: Always use the official prompt rewriting tools (via polish_edit_prompt) for stable editing results. Raw prompts often lead to artifacts.

Limitations and Best Practices

While powerful, Qwen-Image has practical boundaries:

  • Prompt sensitivity: Editing quality heavily depends on rewritten prompts via Qwen-VL-Max. Skipping this step risks inconsistent outputs.
  • Dependencies: Requires transformers>=4.51.3 and the latest diffusers from GitHub.
  • Multi-image input: Works best with 1–3 images; more inputs may reduce coherence.
  • Hardware: Full-precision inference needs a modern GPU, but community tools like DiffSynth-Studio enable 4GB VRAM usage via offloading and FP8 quantization.

Ecosystem and Deployment Options

You don’t need to build from scratch. Qwen-Image is already integrated across major platforms:

  • Hugging Face Spaces and Qwen Chat for no-code demos.
  • ModelScope for LoRA training, FP8 quantization, and low-memory inference.
  • ComfyUI, LiblibAI, and WaveSpeedAI for workflow integration.
  • LeMiCa and cache-dit for 3× faster inference without quality loss.

For teams, the Multi-GPU API Server (included in the repo) supports concurrent requests, automatic prompt enhancement, and aspect ratio handling out of the box.

Summary

Qwen-Image solves two of the most persistent challenges in generative AI: faithful multilingual text rendering and identity-preserving image editing. Backed by rigorous training strategies and dual-encoding architecture, it delivers production-ready quality for real-world applications—from global marketing campaigns to personalized design automation. With open weights, Apache 2.0 licensing, and deep ecosystem support, it’s one of the most practical open-source image foundation models available today.

If your project demands accurate text-in-image or reliable visual editing, Qwen-Image is worth your immediate evaluation.