Personalized image generation has long struggled with a fundamental trade-off: how to maintain strong identity fidelity while enabling flexible, high-quality…
OmniDocBench: A Real-World, Fine-Grained Benchmark for Fair and Comprehensive PDF Document Parsing Evaluation 1279
Evaluating document parsing systems has long been a frustrating exercise in inconsistency. Many existing benchmarks focus narrowly on clean academic…
detrex: A Unified, Modular Benchmark for Detection Transformers—Accelerate Object Detection, Segmentation, and Pose Estimation Research 2250
If you’re evaluating object detection frameworks for a new computer vision project, you’ve likely encountered the rise of DETR (Detection…
RepViT: Real-Time Mobile Vision with Pure CNN Speed and ViT-Level Accuracy 1009
In the world of on-device computer vision, the tension between speed and accuracy has long defined what’s possible. Engineers building…
UniDepthV2: Zero-Shot Monocular Metric Depth Estimation That Works Across Real-World Domains 1091
Monocular metric depth estimation (MMDE)—the task of predicting real-world depth values from a single RGB image—is foundational for 3D perception…
REINFORCE++: A Critic-Free RLHF Algorithm for Faster, More Robust LLM Alignment 8585
Aligning large language models (LLMs) with human preferences is essential for building safe, helpful, and reliable AI systems. Reinforcement Learning…
Mini-Monkey: Fixing Fragmented Vision in Lightweight Multimodal Models with Smart Multi-Scale Cropping 1923
When it comes to deploying multimodal large language models (MLLMs) in real-world applications—especially on cost-sensitive or edge devices—lightweight models are…
ViTPose: High-Accuracy, Scalable Pose Estimation Without Complex Custom Designs 1859
Human and animal pose estimation has long relied on hand-crafted convolutional architectures, intricate post-processing, or task-specific modules. ViTPose changes that…
PP-HumanSeg: Real-Time, Connectivity-Aware Human Portrait Segmentation for Video Conferencing and Edge Applications 9242
In the era of remote collaboration, virtual meetings have become the norm—making clean, real-time human portrait segmentation essential for professional…
StrongSORT: A High-Performance, Plug-and-Play Multi-Object Tracker for Real-World Video Applications 3832
Multi-object tracking (MOT) is a cornerstone of modern computer vision systems—powering everything from autonomous vehicles to retail analytics and security…