OmDet: Real-Time Open-Vocabulary Object Detection with Transformer Speed and Zero-Shot Accuracy

Paper & Code

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

2024 • om-ai-lab/OmDet

★1360

OmDet is a breakthrough in open-vocabulary object detection (OVD)—a vision-language paradigm that enables models to recognize not just pre-defined object categories, but any object described via natural language. While earlier transformer-based OVD models (like Grounding DINO) demonstrated impressive zero-shot capabilities, they often came at the cost of slow inference, limiting real-world applicability.

Enter OmDet-Turbo, the latest evolution from the OmDet project. It delivers state-of-the-art zero-shot detection performance while achieving over 100 frames per second (FPS) on an A100 GPU when optimized with TensorRT and language caching. This rare combination of speed, accuracy, and open-vocabulary flexibility makes OmDet-Turbo uniquely suited for time-sensitive industrial applications—from robotics and autonomous systems to smart manufacturing and surveillance—where detecting both known and novel objects in real time is critical.

Why OmDet Stands Out: Speed Meets Open-Vocabulary Intelligence

Traditional object detectors require extensive labeled datasets for each new object class. Retraining for every new use case is costly and impractical. Open-vocabulary detectors bypass this by leveraging pre-trained vision-language models (like CLIP) to understand object categories through text prompts—enabling zero-shot inference without fine-tuning.

However, most OVD models sacrifice speed for this flexibility. OmDet-Turbo changes that equation. Its Efficient Fusion Head (EFH) rethinks how visual and textual features are combined. Instead of computationally heavy cross-attention in the detection head (a bottleneck in models like Grounding DINO), EFH streamlines multimodal fusion, drastically reducing latency while preserving detection quality.

The results speak for themselves:

30.1 AP on ODinW and 26.86 NMS-AP on OVDEval—new state-of-the-art in zero-shot benchmarks
42.5 AP on COCO and 30.3 AP on LVIS—performance rivaling supervised models
100.2 FPS (with TensorRT) on OmDet-Turbo-Base—real-time speed previously unseen in transformer-based OVD

This isn’t just academic progress—it’s a practical leap for deployment.

Real-World Scenarios Where OmDet Excels

OmDet-Turbo shines in environments where adaptability and speed are non-negotiable:

Industrial Automation: Detecting arbitrary parts on a conveyor belt using simple text prompts like “red plastic cap” or “metal bracket type-B”—no retraining needed when product lines change.
Autonomous Vehicles & Drones: Identifying rare or unseen obstacles (“fallen tree,” “construction cone”) in real time using natural language descriptions from navigation systems.
Retail & Inventory Management: Recognizing new product SKUs the moment they appear on shelves, simply by querying their name or description.
Robotic Manipulation: Enabling robots to locate and interact with objects specified verbally or via text in dynamic, unstructured settings.

Because OmDet operates in an open-vocabulary regime, it eliminates the need for collecting and annotating thousands of images for every new object class—a major operational bottleneck in traditional computer vision pipelines.

Technical Innovations: Solving the OVD Speed-Accuracy Trade-Off

OmDet-Turbo directly addresses three core pain points in modern object detection:

Slow inference in open-vocabulary models: Grounding DINO and similar architectures often run below 10 FPS. OmDet-Turbo’s EFH reduces head complexity and offloads fusion operations efficiently.
Dependency on closed-set training data: Unlike YOLO or Faster R-CNN variants, OmDet doesn’t require category-specific annotations for deployment—just a text prompt.
Inefficient multimodal integration: Earlier fusion strategies redundantly reprocess language embeddings per image. OmDet-Turbo applies language caching, reusing text embeddings across frames or detections—cutting redundant computation.

These innovations are not theoretical—they’re engineered for real GPU workloads and optimized for production via TensorRT and ONNX export.

Getting Started: Simple Integration Paths

OmDet is designed for ease of adoption:

Local Inference: Download the pretrained model and CLIP checkpoints, place them in a resources folder, and run run_demo.py. Outputs are saved as annotated images in ./outputs.
API Deployment: Launch a REST server with run_wsgi.py. The endpoint /inf_predict accepts image-text pairs, and interactive docs are available at /docs.
Production Optimization: Export to ONNX using export.py (with configurable input sizes and optional post-processing), enabling integration into edge or cloud inference pipelines.

Since September 2024, OmDet-Turbo is also natively supported in Hugging Face Transformers v4.45.0, simplifying model loading and interoperability with existing workflows.

Limitations and Practical Considerations

While OmDet-Turbo is powerful, prospective users should consider:

Hardware requirements: The 100+ FPS claim assumes an A100 GPU with TensorRT. Performance scales with available compute—lower-end GPUs will yield reduced throughput.
Model variants: Currently, only Tiny and Base sizes are publicly released. Larger variants may offer higher accuracy but at reduced speed.
ONNX export constraints: By default, exported ONNX models use fixed input resolutions. Dynamic shapes require manual modification of the export script.
Language dependency: OmDet relies on CLIP for text understanding, so prompt quality and semantic alignment with visual concepts directly impact detection quality.

These factors don’t diminish OmDet’s value—they simply help teams assess fit for their specific latency, accuracy, and infrastructure constraints.

Summary

OmDet—particularly the OmDet-Turbo variant—successfully bridges the gap between cutting-edge open-vocabulary detection and real-time industrial deployment. By introducing the Efficient Fusion Head and leveraging language caching, it achieves unprecedented speed without compromising zero-shot accuracy. For technical decision-makers evaluating object detection solutions that must adapt to novel categories on the fly while running at video rates, OmDet offers a compelling, production-ready answer. With support in Hugging Face Transformers, ONNX export, and a straightforward API, it lowers the barrier to integrating state-of-the-art OVD into real-world systems.