PaperCodex

Mini-Monkey: Fixing Fragmented Vision in Lightweight Multimodal Models with Smart Multi-Scale Cropping 1923

When it comes to deploying multimodal large language models (MLLMs) in real-world applications—especially on cost-sensitive or edge devices—lightweight models are…

12/22/2025Document Understanding, Multimodal Reasoning, Optical Character Recognition (OCR)

ViTPose: High-Accuracy, Scalable Pose Estimation Without Complex Custom Designs 1859

Human and animal pose estimation has long relied on hand-crafted convolutional architectures, intricate post-processing, or task-specific modules. ViTPose changes that…

12/22/2025Animal Pose Estimation, Human Pose Estimation, Whole-Body Pose Estimation

PP-HumanSeg: Real-Time, Connectivity-Aware Human Portrait Segmentation for Video Conferencing and Edge Applications 9242

In the era of remote collaboration, virtual meetings have become the norm—making clean, real-time human portrait segmentation essential for professional…

12/22/2025Human Portrait Segmentation, Real-time Semantic Segmentation, Video Conferencing Background Replacement

StrongSORT: A High-Performance, Plug-and-Play Multi-Object Tracker for Real-World Video Applications 3832

Multi-object tracking (MOT) is a cornerstone of modern computer vision systems—powering everything from autonomous vehicles to retail analytics and security…

12/22/2025MOT, Multi-Object Tracking, Video Object Tracking

D-FINE: Real-Time Object Detection with DETR-Level Accuracy and No Inference Overhead 2756

Object detection has long faced a fundamental trade-off: high accuracy or real-time speed—but rarely both. Enter D-FINE, a breakthrough real-time…

12/22/2025DETR-based Models, Object Detection, Real-Time Inference

ClearerVoice-Studio: A Practical, All-in-One Toolkit for Real-World Speech Enhancement, Separation, and Speaker Extraction 3717

In today’s audio-rich digital landscape—spanning call centers, video conferencing, voice assistants, and multimedia content—clean, high-quality speech isn’t a luxury; it’s…

12/20/2025Speaker Extraction, Speech Enhancement, Speech Separation

OpenSTL: A Standardized, Reproducible Benchmark for Spatio-Temporal Forecasting Across Video, Weather, and Traffic Domains 1030

Spatio-temporal predictive learning aims to forecast future states—like video frames, weather maps, or traffic patterns—based solely on past observations, typically…

12/20/2025Spatio-temporal Forecasting, Time-series Forecasting, Video Prediction

AirSLAM: Robust Visual SLAM for Real-World Lighting Changes – Point-Line Fusion, Real-Time Speed, and Embedded Deployment 1101

Imagine deploying an autonomous robot in a warehouse that shifts from bright daylight to dim artificial lighting—or a drone navigating…

12/19/2025Illumination-Robust Localization, Point-Line Feature Fusion, Visual SLAM

DEIM: Slash DETR Training Time by 50% Without Sacrificing Accuracy for Real-Time Object Detection 1348

Real-time object detection has become a cornerstone of modern computer vision applications—from autonomous vehicles and robotics to industrial inspection and…

12/19/2025DETR Acceleration, Real-time Object Detection, Transformer-based Detection

Instruction Pre-Training: Boost Language Model Performance from Day One with Supervised Multitask Pre-Training 4150

Traditional language model (LM) development follows a two-stage process: unsupervised pre-training on massive raw text corpora, followed by instruction tuning…

12/19/2025Instruction Tuning, Language Model Pre-training, Multitask Learning