ZoeDepth: Metric-Accurate, Zero-Shot Monocular Depth Estimation for Real-World Applications

Paper & Code

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

2023 • isl-org/ZoeDepth

★2755

Depth estimation from a single RGB image—monocular depth estimation—is a foundational task in computer vision with far-reaching implications in robotics, augmented reality, autonomous driving, and 3D content creation. Yet, most existing solutions force a difficult trade-off: either deliver depth that looks plausible but lacks real-world scale (relative depth), or achieve metric accuracy but only on a specific dataset after extensive fine-tuning (metric depth).

Enter ZoeDepth, an open-source framework that bridges this gap for the first time. Built on the insight that relative and metric depth cues are complementary, ZoeDepth combines both in a unified architecture to deliver metrically scaled depth maps that generalize zero-shot across diverse indoor and outdoor environments—without retraining.

Its flagship model, ZoeD-NK, is trained on relative depth signals from twelve datasets and fine-tuned on both NYU Depth v2 (indoor) and KITTI (outdoor). This enables it to maintain high metric accuracy while exhibiting unprecedented generalization to unseen scenes. For practitioners who need reliable depth from off-the-shelf models—without managing multiple domain-specific pipelines—ZoeDepth offers a rare balance of accuracy, versatility, and ease of use.

Why ZoeDepth Stands Out

Zero-Shot Generalization Without Sacrificing Metric Scale

Traditional metric depth models degrade sharply when applied outside their training domain. Conversely, relative depth models preserve structure but lack physical scale. ZoeDepth eliminates this dichotomy by introducing a novel metric bins module and a latent classifier that routes inputs to the appropriate domain-specific head (e.g., indoor vs. outdoor).

As a result, ZoeD-NK achieves state-of-the-art performance on NYU Depth v2—21% improvement in relative absolute error (REL) over prior methods—while simultaneously delivering strong results on KITTI and eight unseen datasets spanning diverse environments, from city streets to forest trails. This zero-shot capability is invaluable when deploying systems in dynamic, real-world settings where labeled data isn’t available.

Ready-to-Use Models via Torch Hub

ZoeDepth prioritizes accessibility. Pre-trained models are available directly through PyTorch Hub with just one line of code:

model = torch.hub.load("isl-org/ZoeDepth", "ZoeD_NK", pretrained=True)

This eliminates complex setup routines. Whether you’re processing local images, loading from a URL, or integrating into a larger pipeline, inference is consistent and straightforward. Outputs can be returned as NumPy arrays, PIL images (16-bit depth maps), or PyTorch tensors—making it compatible with most vision workflows.

Flexible Model Variants for Different Needs

ZoeDepth ships with three primary variants:

ZoeD-N: Optimized for indoor scenes (trained on NYU Depth v2).
ZoeD-K: Tuned for outdoor driving scenarios (trained on KITTI).
ZoeD-NK: A multi-headed model that supports both domains in a single checkpoint, automatically selecting the best head at inference time.

This flexibility lets developers choose between specialization and universality based on their application’s scope.

Ideal Use Cases

ZoeDepth is particularly valuable in scenarios where:

Sensor constraints limit hardware options: You only have a standard RGB camera (e.g., on a drone, mobile phone, or consumer robot) but still need depth for navigation or scene understanding.
Deployment spans multiple environments: Your application moves between indoor offices and outdoor urban areas—ZoeD-NK handles both without switching models.
Rapid prototyping is essential: You need a working depth pipeline in minutes, not weeks of training and tuning.
Metric scale matters: Applications like AR object placement, 3D reconstruction, or robot grasping require depth in real-world units (meters), not just relative ordering.

Examples include autonomous delivery bots navigating warehouses and sidewalks, creators generating 3D assets from smartphone photos, or researchers benchmarking perception systems across diverse domains.

Getting Started in Minutes

With a GPU (or even a CPU for lightweight use), you can run depth inference in under five lines of code:

import torch
zoe = torch.hub.load("isl-org/ZoeDepth", "ZoeD_NK", pretrained=True).to("cuda")
from PIL import Image
image = Image.open("scene.jpg").convert("RGB")
depth = zoe.infer_pil(image)  # Returns a metric depth map in meters

For non-developers or quick testing, the included Gradio demo provides a web-based UI to upload images and instantly visualize depth predictions—no coding required.

Environment setup is equally streamlined via environment.yml, which installs all dependencies (PyTorch, timm, OpenCV, etc.) using conda or mamba.

Practical Considerations

While ZoeDepth is powerful, users should be aware of a few limitations:

No active maintenance: Intel has discontinued official support. The project is stable but future bug fixes or updates will require community forks.
Single-image input only: It does not support video sequences or temporal smoothing, so flickering may occur in frame-by-frame video processing. Multi-view or stereo fusion is also unsupported.
GPU recommended: Although CPU inference works, real-time performance typically requires a CUDA-enabled GPU.
Domain adaptation may still be needed: While zero-shot performance is strong, highly specialized domains (e.g., underwater imaging or medical endoscopy) may benefit from fine-tuning on task-specific data.

Summary

ZoeDepth redefines what’s possible in monocular depth estimation by unifying relative and metric learning into a single, generalizable framework. With its zero-shot transfer capability, metric accuracy, and plug-and-play usability, it empowers engineers and researchers to deploy robust depth perception systems across real-world domains—without the overhead of dataset-specific retraining. For anyone working with RGB-only vision systems who needs depth that’s both structurally coherent and physically meaningful, ZoeDepth is a compelling, production-ready choice.