Awesome Multimodal Reinforcement Learning Papers and Source Codes

Video-R1: Boost Video Reasoning in MLLMs with Efficient RL—Outperforming GPT-4o on Spatial Tasks 709

Video understanding has long been a bottleneck for multimodal large language models (MLLMs). While models can recognize objects or scenes…

01/09/2026Multimodal Reinforcement Learning, Temporal Modeling, Video Reasoning

DeepEyes: Enable Vision-Language Models to “Think with Images” and Solve Complex Visual Reasoning Tasks 858

Most modern Vision-Language Models (VLMs) treat images as static inputs—processed once, then reasoned about using purely text-based logic. But humans…

01/09/2026Multimodal Reinforcement Learning, vision-language modeling, Visual Reasoning

SoundMind: Boost Audio-Language Models with Reinforcement-Learned Logical Reasoning 1101

Most large language models (LLMs) today excel at reasoning over text—but what happens when the input includes sounds? Can an…

12/19/2025Audio-language Reasoning, Logical Reasoning In AI, Multimodal Reinforcement Learning