GANformer: Compositional, Controllable Image Generation with Fewer Training Steps

Paper & Code

Compositional Transformers for Scene Generation

2021 • dorarad/gansformer

★1342

Traditional generative adversarial networks (GANs) often act as “black boxes”—they produce compelling images but offer little insight into how those images are structured or how to reliably control specific elements within them. For practitioners working on tasks like synthetic data generation, layout-to-image translation, or interpretable scene modeling, this lack of structure and control is a major bottleneck.

Enter GANformer, a next-generation image generation architecture that rethinks how GANs model visual scenes. Built on a novel bipartite transformer structure, GANformer explicitly represents images as compositions of interacting objects, enabling not just high-fidelity synthesis but also step-by-step refinement, latent disentanglement, and fine-grained control—all while requiring significantly fewer training steps than models like StyleGAN2.

Whether you’re generating multi-object scenes for robotics simulation, creating editable urban layouts, or prototyping designs with interpretable generation paths, GANformer offers a transparent, efficient, and powerful alternative to monolithic GAN architectures.

Why GANformer Stands Out

Moving Beyond Flat Latent Spaces

Conventional GANs—especially StyleGAN variants—use a single global latent vector to modulate the entire image. While effective for single-object generation (e.g., faces), this approach struggles with complex scenes containing multiple interacting entities. The latent space is “flat,” offering no built-in notion of objecthood or spatial relationships.

GANformer replaces this with a compositional latent structure: instead of one latent vector, it uses k latent components (e.g., 8–16), each capable of specializing in different regions or objects. These components interact with the evolving image features through a bipartite attention mechanism, allowing the model to build scenes iteratively—first sketching a layout, then refining object details, depth, and dependencies.

Two-Phase Generation: Plan, Then Execute

The GANformer2 variant (introduced in Compositional Transformers for Scene Generation) formalizes this process into two distinct stages:

Planning Phase: A lightweight, high-level layout is drafted—positioning objects, estimating spatial relationships, and establishing scene structure.
Execution Phase: Using attention-based refinement, the layout evolves into a full-resolution, photorealistic image, with each latent component modulating its designated region.

This mirrors how humans conceptualize scenes: not all at once, but through structured composition. The result? Better consistency across multi-object scenes and more interpretable generation trajectories.

Efficiency Without Sacrificing Quality

One of GANformer’s most practical advantages is its training efficiency. Pretrained models achieve state-of-the-art FID scores on benchmarks like CLEVR, LSUN-Bedrooms, FFHQ, and Cityscapes after only 5,000–15,000 kimg steps—roughly 5–7× fewer than StyleGAN2. This means you can get high-quality results faster, with less computational overhead, making experimentation and iteration more accessible.

Real-World Use Cases

Synthetic Data for Perception Systems

Autonomous vehicles, robotics, and AR/VR applications often need large volumes of labeled, diverse, and physically plausible scene data. GANformer excels at generating structured multi-object scenes (e.g., Cityscapes-style urban environments or CLEVR-like object arrangements) with consistent object placement and depth ordering—critical for training robust perception models.

Controllable Image Editing & Layout-to-Image Synthesis

Because GANformer’s latent components map to distinct visual regions, you can manipulate individual objects or scene attributes without affecting the whole image. The model also natively supports conditional generation, such as turning a semantic layout into a photorealistic image—ideal for design prototyping or content creation workflows.

Applications Demanding Interpretability

In scientific visualization, medical imaging, or safety-critical systems, “black-box” generation is often unacceptable. GANformer’s step-by-step refinement and attention maps provide visual explainability: you can trace how each latent component contributes to the final output, enabling debugging, validation, and user trust.

Getting Started Quickly

GANformer is designed for immediate experimentation. The repository provides:

Pretrained models for resolutions up to 1024×1024 (FFHQ) and 1024×2048 (Cityscapes).
Support for both PyTorch and TensorFlow, with nearly identical interfaces.
A minimal generate.py script that downloads and runs a model in seconds:

python generate.py --gpus 0 --model gdrive:bedrooms-snapshot.pkl --output-dir images --images-num 32

Key parameters like --truncation-psi (typically 0.6–1.0) let you balance image diversity and quality on the fly.

For custom datasets, the prepare_data.py script supports common formats (PNG, JPG, HDF5, LMDB) and automatically adapts to your image resolution. Training a new model from scratch or fine-tuning a pretrained one requires just a single command via run_network.py.

Practical Limitations

While powerful, GANformer isn’t a universal replacement for all GANs:

GPU memory intensive: High-resolution models (e.g., 1024×1024) often require --batch-size 1 to fit on a 12GB GPU.
CUDA dependency: Custom ops (e.g., for up/downsampling) must compile via NVCC, requiring matched CUDA and PyTorch/TensorFlow versions.
Best for structured scenes: If your task involves single-object generation (e.g., face synthesis), StyleGAN2 may still be simpler and sufficient.
Image-only: The model is designed for 2D image generation—not video, 3D, or audio.

That said, for any project involving compositional visual content, GANformer’s blend of control, efficiency, and interpretability makes it a compelling choice.

Summary

GANformer redefines image generation by embedding compositional reasoning directly into the GAN architecture. Through its bipartite transformer design, iterative refinement process, and disentangled latent control, it solves key pain points in modern generative modeling: lack of structure, poor multi-object consistency, excessive training times, and uninterpretable outputs.

With pretrained models, dual framework support, and clear pathways for customization, it’s ready for real-world adoption—whether you’re building synthetic datasets, designing controllable editors, or pushing the boundaries of interpretable AI. If your work involves generating complex, structured visual scenes, GANformer deserves a spot in your toolkit.