Video understanding has long been a bottleneck for multimodal large language models (MLLMs). While models can recognize objects or scenes…
Multimodal Reinforcement Learning
DeepEyes: Enable Vision-Language Models to “Think with Images” and Solve Complex Visual Reasoning Tasks 858
Most modern Vision-Language Models (VLMs) treat images as static inputs—processed once, then reasoned about using purely text-based logic. But humans…
SoundMind: Boost Audio-Language Models with Reinforcement-Learned Logical Reasoning 1101
Most large language models (LLMs) today excel at reasoning over text—but what happens when the input includes sounds? Can an…