YOLOv6 is a high-performance, single-stage object detection framework developed by Meituan with a strong emphasis on real-world industrial applications. Unlike…
MME: The First Comprehensive Benchmark to Objectively Evaluate Multimodal Large Language Models 17004
Multimodal Large Language Models (MLLMs) have captured the imagination of researchers and developers alike—promising capabilities like generating poetry from images,…
OpenAGI: Build Smarter AI Agents by Combining LLMs with Domain Experts 2224
In today’s AI landscape, building systems that handle real-world complexity often means stitching together language models, specialized tools, APIs, and…
Agent-E: Reliable, Hierarchical Web Automation Powered by Proven Agentic Design Principles 1195
In today’s fast-paced digital landscape, automating browser-based workflows—from filling forms to comparing products—has become essential for both individuals and enterprises.…
BEVFusion: Unified Bird’s-Eye View Fusion for Accurate, Efficient Multi-Sensor Perception in Autonomous Driving 2943
Building reliable perception systems for autonomous driving demands more than just collecting data from cameras and LiDARs—it requires intelligently fusing…
Magic Clothing: Generate Photorealistic Outfits with Exact Garment Control and Text Guidance 1535
Magic Clothing is a cutting-edge solution for a long-standing challenge in AI-powered visual content creation: how to generate realistic human…
ESPnet-ST: Open-Source Toolkit for Offline, Simultaneous, and Speech-to-Speech Translation 9641
In an increasingly multilingual and interconnected world, spoken language translation (SLT) has moved beyond academic curiosity to become a critical…
Vocos: High-Quality, Real-Time Neural Vocoder Using Fourier Spectra for Efficient Audio Synthesis 1028
If you’re building or evaluating text-to-speech (TTS), voice cloning, or generative audio systems, the choice of neural vocoder can make…
VideoMamba: Efficient Long- and Short-Term Video Understanding Without the Compute Overhead 1044
Video understanding has long been bottlenecked by two competing demands: capturing fine-grained local motion while simultaneously modeling long-range temporal dependencies.…
MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference 2282
MoE-LLaVA (Mixture of Experts for Large Vision-Language Models) redefines efficiency in multimodal AI by delivering performance that rivals much larger…