LBM: One-Step, Multi-Task Image Translation with State-of-the-Art Speed and Simplicity

Paper & Code

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

2025 • gojasper/LBM

★728

Image-to-image translation is a foundational capability in computer vision, enabling applications from photo editing to 3D scene understanding. Yet many existing approaches suffer from slow inference (e.g., iterative diffusion models), fragmented workflows (separate models per task), or limited control (e.g., fixed outputs with no user guidance). Enter LBM (Latent Bridge Matching)—a novel framework introduced in LBM: Latent Bridge Matching for Fast Image-to-Image Translation (ICCV 2025 Highlight) that tackles these pain points head-on.

LBM delivers state-of-the-art results across diverse image translation tasks in a single inference step, all within a unified architecture. Whether you need to remove objects, estimate depth, relight scenes, or generate shadows on demand, LBM provides a fast, scalable, and surprisingly simple solution—ideal for researchers, developers, and engineers seeking efficiency without sacrificing quality.

Why LBM Changes the Game

One Inference Step, Multiple Capabilities

Unlike diffusion-based or GAN-based translators that often require dozens of denoising steps or complex adversarial training, LBM operates in latent space using a technique called Bridge Matching. This allows it to directly map source images to target outputs in one forward pass—dramatically accelerating inference while maintaining high fidelity.

For teams building real-time applications or iterating rapidly in research, this speed translates into tangible gains: faster prototyping, lower compute costs, and smoother user experiences.

A Single Model, Many Tasks

LBM isn’t narrowly specialized. The same core methodology adapts seamlessly to:

Object removal (translating masked images to clean, object-free outputs)
Image restoration (mapping degraded inputs to pristine reconstructions)
Normal and depth estimation (predicting 3D surface geometry from 2D images)
Object relighting (adjusting illumination on foreground objects)
Controllable shadow generation (via its conditional variant)

This versatility eliminates the need to maintain separate pipelines for each task—a common operational burden in vision systems.

Controllable Outputs When You Need Them

While LBM works effectively in an unconditional mode, it also offers a conditional framework that empowers user-guided translation. For instance, in relighting tasks, you can specify desired lighting conditions, and LBM will produce correspondingly lit images with realistic shadows—enabling creative or technical control that generic models lack.

Getting Started Is Effortless

LBM is designed for immediate usability:

Install: Set up a Python 3.10 environment and install the package in editable mode:
```
pip install -e .  
```

Run Pre-Trained Models: Use a single command to perform depth estimation, normal prediction, or relighting:

python examples/inference/inference.py --model_name depth --source_image your_image.jpg --output_path results/

Interact with the Gradio Demo: Launch a local web interface for visual experimentation:
```
python examples/inference/gradio_demo.py  
```

All pre-trained checkpoints (for depth, normals, and relighting) are hosted on Hugging Face Hub and downloaded automatically—no manual model wrangling required. For common tasks, you can go from zero to inference in under five minutes.

Real-World Use Cases That Matter

Streamlined Object Removal

Instead of relying on complex inpainting workflows that struggle with structure and texture coherence, LBM treats object removal as a distribution transport problem: it learns to map masked inputs directly to complete, realistic outputs—often with fewer artifacts and greater consistency.

Instant 3D Geometry from 2D

For robotics, AR, or scene reconstruction, LBM’s depth and normal estimators provide high-quality geometric cues from a single RGB image—without requiring stereo cameras or LiDAR. This makes it valuable for lightweight perception systems.

Dynamic Relighting and Shadow Control

In product photography, virtual try-ons, or visual effects, controlling how objects are lit is crucial. LBM’s conditional mode allows precise manipulation of lighting direction and intensity, including realistic shadow casting—something typically reserved for physically based renderers or expensive manual editing.

When to Use LBM—and When to Think Twice

LBM excels in non-commercial research, academic projects, and internal tooling, thanks to its impressive speed and multi-task performance. However, note two key constraints:

License: The code is released under the Creative Commons BY-NC 4.0 license, meaning it cannot be used in commercial products without explicit permission.
Custom Training Complexity: While inference is plug-and-play, training your own LBM model requires data formatted in WebDataset (.tar) structure, with specific file naming (e.g., jpg, normal.png, mask.png). You’ll also need to edit YAML config files—feasible for ML engineers, but not “zero-code.”

Thus, LBM is ideal if you’re leveraging pre-trained models or have the infrastructure to prepare WebDataset-formatted data.

How LBM Stacks Up Against Alternatives

Compared to traditional image translation methods:

Faster than diffusion models: No iterative sampling; just one neural net pass.
More unified than GAN ensembles: One architectural paradigm serves multiple tasks.
More controllable than black-box translators: Conditional variants enable user-specified outputs.

For prototyping, benchmarking, or building non-commercial vision tools that demand both speed and quality, LBM offers a compelling middle ground between simplicity and performance.

Summary

LBM redefines what’s possible in fast, multi-task image translation. By combining single-step inference, broad task coverage, and optional controllability, it addresses critical bottlenecks in both research and applied settings. With easy setup, pre-trained models readily available, and a clean codebase, it lowers the barrier to high-quality image transformation—making it a strong candidate for your next computer vision project, as long as your use case aligns with its non-commercial license.