PaperCodex

Cosmos-Reason1: Enable Physical AI Agents to Reason Like Humans Using Physics, Space, and Time 746

Building intelligent systems that can understand and act in the real physical world remains one of the toughest challenges in…

01/09/2026Embodied AI, Physical Reasoning, vision-language modeling

UniVLA: Enable Robots to Generalize Across Embodiments with Minimal Data and Compute 771

Imagine deploying a single robot policy that works across different hardware—robotic arms, mobile bases, or even human-inspired setups—without retraining from…

01/09/2026Cross-embodiment Generalization, Robotic Manipulation, Vision-language-action

Radial Attention: Generate 4× Longer Videos 3.7× Faster with O(n log n) Sparse Attention 519

Generating high-quality, long-form videos with diffusion models remains one of the most computationally demanding tasks in generative AI. Standard attention…

01/09/2026Long-context Modeling, Sparse Attention, Video Generation

DiffuCoder: Generate Better Code with Iterative, Non-Autoregressive Diffusion Models 745

If you’re evaluating next-generation code generation tools, you’ve likely worked with autoregressive (AR) large language models—systems that build code one…

01/09/2026Code Generation, Diffusion Language Models, Reinforcement Learning For Code

Seg-Zero: Interpretable, Zero-Shot Image Segmentation with Reasoning Chains and Reinforcement Learning 527

Image segmentation has long been a cornerstone of computer vision—yet traditional approaches often behave like black boxes, especially when faced…

01/09/2026Interpretable Vision Models, Visual Reasoning, Zero-shot Segmentation

$Vision-R1: Boost Multimodal Reasoning in Visual Math and Complex Problem Solving Without Human Annotations$

Vision-R1: Boost Multimodal Reasoning in Visual Math and Complex Problem Solving Without Human Annotations 710

If you’re evaluating multimodal AI systems for tasks that demand deep reasoning—such as solving visual math problems, interpreting charts, or…

01/09/2026Interleaving Reasoning, Multimodal Reasoning, Visual Math Problem Solving

Fin-R1: A 7B Financial Reasoning LLM That Outperforms Larger Models on Complex Finance Tasks 688

Fin-R1 is a purpose-built reasoning large language model (LLM) designed specifically for the financial domain. Despite having only 7 billion…

01/09/2026Financial Reasoning, Quantitative Finance, Regulatory Compliance

MM-Eureka: High-Accuracy Multimodal Reasoning for STEM Education and Technical QA 737

In the rapidly evolving field of multimodal AI, most models still struggle to combine visual understanding with precise, step-by-step logical…

01/09/2026Multimodal Reasoning, Rule-based Reinforcement Learning, STEM Question Answering

LBM: One-Step, Multi-Task Image Translation with State-of-the-Art Speed and Simplicity 728

Image-to-image translation is a foundational capability in computer vision, enabling applications from photo editing to 3D scene understanding. Yet many…

01/09/2026Depth Estimation, Image-to-image Translation, Object Relighting

SpatialTrackerV2: Real-Time 3D Point Tracking from Monocular Video—Fast, Accurate, and End-to-End 798

If you’ve ever tried to track 3D points in a monocular video—say, for robotics perception, AR/VR content creation, or sports…

01/09/20263D Point Tracking, Dynamic Scene Reconstruction, Monocular Depth Estimation