Skip to content

PaperCodex

Subscribe

Multimodal Reasoning

OpenEMMA: Open-Source End-to-End Autonomous Driving with Multimodal Reasoning and Transparent Planning

OpenEMMA: Open-Source End-to-End Autonomous Driving with Multimodal Reasoning and Transparent Planning 873

Autonomous driving research has long been bottlenecked by the need for massive datasets, expensive compute infrastructure, and proprietary end-to-end frameworks.…

01/13/2026End-to-End Autonomous Driving, Multimodal Reasoning, Vision-language Models
R1-Onevision: Solve Complex Visual Reasoning Problems with Step-by-Step Multimodal AI

R1-Onevision: Solve Complex Visual Reasoning Problems with Step-by-Step Multimodal AI 569

In today’s AI landscape, most multimodal models can describe what’s in an image—but few can reason through it. If your…

01/09/2026Multimodal Reasoning, Scientific Diagram Understanding, Visual Question Answering
TinyLVLM-eHub: Fast, Lightweight Evaluation for Large Vision-Language Models Without Heavy Compute

TinyLVLM-eHub: Fast, Lightweight Evaluation for Large Vision-Language Models Without Heavy Compute 539

As Large Vision-Language Models (LVLMs) grow increasingly capable—and increasingly complex—evaluating their multimodal reasoning, perception, and reliability has become a significant…

01/09/2026Model Evaluation, Multimodal Reasoning, Visual Question Answering
Vision-R1: Boost Multimodal Reasoning in Visual Math and Complex Problem Solving Without Human Annotations

Vision-R1: Boost Multimodal Reasoning in Visual Math and Complex Problem Solving Without Human Annotations 710

If you’re evaluating multimodal AI systems for tasks that demand deep reasoning—such as solving visual math problems, interpreting charts, or…

01/09/2026Interleaving Reasoning, Multimodal Reasoning, Visual Math Problem Solving
MM-Eureka: High-Accuracy Multimodal Reasoning for STEM Education and Technical QA

MM-Eureka: High-Accuracy Multimodal Reasoning for STEM Education and Technical QA 737

In the rapidly evolving field of multimodal AI, most models still struggle to combine visual understanding with precise, step-by-step logical…

01/09/2026Multimodal Reasoning, Rule-based Reinforcement Learning, STEM Question Answering
Mini-Gemini: Close the Gap with GPT-4V and Gemini Using Open, High-Performance Vision-Language Models

Mini-Gemini: Close the Gap with GPT-4V and Gemini Using Open, High-Performance Vision-Language Models 3323

In today’s AI landscape, multimodal systems that understand both images and language are no longer a luxury—they’re a necessity. Yet,…

12/31/2025Document Understanding, Multimodal Reasoning, vision-language modeling
Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning

Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning 6422

In the rapidly evolving landscape of multimodal artificial intelligence, developers and technical decision-makers need models that go beyond basic image…

12/27/2025Multimodal Reasoning, vision-language modeling, Visual Question Answering
Bunny: High-Performance Multimodal AI Without the Heavy Compute Burden

Bunny: High-Performance Multimodal AI Without the Heavy Compute Burden 1046

Multimodal Large Language Models (MLLMs) are transforming how machines understand and reason about visual content. Yet, their adoption remains out…

12/27/2025Efficient Inference, Multimodal Reasoning, vision-language modeling
RPG-DiffusionMaster: Generate Complex, Compositional Images from Text—No Retraining Needed

RPG-DiffusionMaster: Generate Complex, Compositional Images from Text—No Retraining Needed 1823

Text-to-image generation has made remarkable strides, yet even state-of-the-art models like DALL·E 3 or Stable Diffusion XL (SDXL) often stumble…

12/27/2025Compositional Image Synthesis, Multimodal Reasoning, Text-to-Image Generation
Qwen3-Omni: One Unified Model for Text, Image, Audio, and Video—Without Compromise

Qwen3-Omni: One Unified Model for Text, Image, Audio, and Video—Without Compromise 3063

Imagine a single AI model that natively understands and generates responses across text, images, audio, and video—all in real time,…

12/27/2025Audio Captioning, Multimodal Reasoning, Real-time Speech Synthesis

Posts pagination

1 2 … 4 Next
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex