Awesome Multimodal Reasoning Papers and Source Codes

OpenEMMA: Open-Source End-to-End Autonomous Driving with Multimodal Reasoning and Transparent Planning 873

Autonomous driving research has long been bottlenecked by the need for massive datasets, expensive compute infrastructure, and proprietary end-to-end frameworks.…

01/13/2026End-to-End Autonomous Driving, Multimodal Reasoning, Vision-language Models

R1-Onevision: Solve Complex Visual Reasoning Problems with Step-by-Step Multimodal AI 569

In today’s AI landscape, most multimodal models can describe what’s in an image—but few can reason through it. If your…

01/09/2026Multimodal Reasoning, Scientific Diagram Understanding, Visual Question Answering

TinyLVLM-eHub: Fast, Lightweight Evaluation for Large Vision-Language Models Without Heavy Compute 539

As Large Vision-Language Models (LVLMs) grow increasingly capable—and increasingly complex—evaluating their multimodal reasoning, perception, and reliability has become a significant…

01/09/2026Model Evaluation, Multimodal Reasoning, Visual Question Answering

$Vision-R1: Boost Multimodal Reasoning in Visual Math and Complex Problem Solving Without Human Annotations$

Vision-R1: Boost Multimodal Reasoning in Visual Math and Complex Problem Solving Without Human Annotations 710

If you’re evaluating multimodal AI systems for tasks that demand deep reasoning—such as solving visual math problems, interpreting charts, or…

01/09/2026Interleaving Reasoning, Multimodal Reasoning, Visual Math Problem Solving

MM-Eureka: High-Accuracy Multimodal Reasoning for STEM Education and Technical QA 737

In the rapidly evolving field of multimodal AI, most models still struggle to combine visual understanding with precise, step-by-step logical…

01/09/2026Multimodal Reasoning, Rule-based Reinforcement Learning, STEM Question Answering

Mini-Gemini: Close the Gap with GPT-4V and Gemini Using Open, High-Performance Vision-Language Models 3323

In today’s AI landscape, multimodal systems that understand both images and language are no longer a luxury—they’re a necessity. Yet,…

12/31/2025Document Understanding, Multimodal Reasoning, vision-language modeling

Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning 6422

In the rapidly evolving landscape of multimodal artificial intelligence, developers and technical decision-makers need models that go beyond basic image…

12/27/2025Multimodal Reasoning, vision-language modeling, Visual Question Answering