Building intelligent systems that can understand and act in the real physical world remains one of the toughest challenges in…
UniVLA: Enable Robots to Generalize Across Embodiments with Minimal Data and Compute 771
Imagine deploying a single robot policy that works across different hardware—robotic arms, mobile bases, or even human-inspired setups—without retraining from…
Radial Attention: Generate 4× Longer Videos 3.7× Faster with O(n log n) Sparse Attention 519
Generating high-quality, long-form videos with diffusion models remains one of the most computationally demanding tasks in generative AI. Standard attention…
DiffuCoder: Generate Better Code with Iterative, Non-Autoregressive Diffusion Models 745
If you’re evaluating next-generation code generation tools, you’ve likely worked with autoregressive (AR) large language models—systems that build code one…
Seg-Zero: Interpretable, Zero-Shot Image Segmentation with Reasoning Chains and Reinforcement Learning 527
Image segmentation has long been a cornerstone of computer vision—yet traditional approaches often behave like black boxes, especially when faced…
Vision-R1: Boost Multimodal Reasoning in Visual Math and Complex Problem Solving Without Human Annotations 710
If you’re evaluating multimodal AI systems for tasks that demand deep reasoning—such as solving visual math problems, interpreting charts, or…
Fin-R1: A 7B Financial Reasoning LLM That Outperforms Larger Models on Complex Finance Tasks 688
Fin-R1 is a purpose-built reasoning large language model (LLM) designed specifically for the financial domain. Despite having only 7 billion…
MM-Eureka: High-Accuracy Multimodal Reasoning for STEM Education and Technical QA 737
In the rapidly evolving field of multimodal AI, most models still struggle to combine visual understanding with precise, step-by-step logical…
LBM: One-Step, Multi-Task Image Translation with State-of-the-Art Speed and Simplicity 728
Image-to-image translation is a foundational capability in computer vision, enabling applications from photo editing to 3D scene understanding. Yet many…
SpatialTrackerV2: Real-Time 3D Point Tracking from Monocular Video—Fast, Accurate, and End-to-End 798
If you’ve ever tried to track 3D points in a monocular video—say, for robotics perception, AR/VR content creation, or sports…