Awesome Multimodal Reasoning Papers and Source Codes | Page 2 of 4

Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters 1122

For teams building real-world AI applications that combine vision and language—whether it’s parsing scanned documents, analyzing instructional videos, or creating…

12/27/2025AI Agent Automation, Multimodal Reasoning, vision-language modeling

GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding 1899

If your team is building AI applications that need to see, reason, and act—like desktop assistants that interpret screenshots, UI…

12/27/2025Multimodal Agents, Multimodal Reasoning, vision-language modeling

HunyuanImage-3.0: The Largest Open-Source Multimodal Image Generator with Native Reasoning and MoE Architecture 2562

HunyuanImage-3.0 is a groundbreaking open-source image generation model developed by Tencent. Unlike traditional diffusion-based approaches, it builds a native multimodal…

12/26/2025Mixture-of-Experts (MoE), Multimodal Reasoning, Text-to-Image Generation

mPLUG-Owl: Modular Multimodal AI for Real-World Vision-Language Tasks 2537

In today’s AI-driven product landscape, the ability to understand both images and text isn’t just a research novelty—it’s a practical…

12/26/2025Instruction-following Multimodal Models, Multimodal Reasoning, Vision-language Understanding

MobileVLM: High-Performance Vision-Language AI That Runs Fast and Privately on Mobile Devices 1314

MobileVLM is a purpose-built vision-language model (VLM) engineered from the ground up for on-device deployment on smartphones and edge hardware.…

12/26/2025Multimodal Reasoning, On-Device AI, Visual Question Answering

Qwen2-VL: Process Any-Resolution Images and Videos with Human-Like Visual Understanding 17241

Vision-language models (VLMs) are increasingly essential for tasks that require joint understanding of images, videos, and text—ranging from document parsing…

12/26/2025Document Understanding, Multimodal Reasoning, vision-language modeling

LISA: Segment Anything by Understanding What You Really Mean 2523

Imagine asking a computer vision system to “segment the object that makes the woman stand higher” or “show me the…

12/26/2025Multimodal Reasoning, Reasoning Segmentation, Visual Question Answering

MME: The First Comprehensive Benchmark to Objectively Evaluate Multimodal Large Language Models 17004

Multimodal Large Language Models (MLLMs) have captured the imagination of researchers and developers alike—promising capabilities like generating poetry from images,…

12/26/2025Multimodal Evaluation, Multimodal Reasoning, vision-language modeling

MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference 2282

MoE-LLaVA (Mixture of Experts for Large Vision-Language Models) redefines efficiency in multimodal AI by delivering performance that rivals much larger…

12/26/2025Multimodal Reasoning, Object Hallucination Reduction, Visual Question Answering

LLaVA-CoT: Step-by-Step Visual Reasoning for Reliable, Explainable Multimodal AI 2108

Most vision-language models (VLMs) today can describe what’s in an image—but they often falter when asked to reason about it.…

12/22/2025Explainable AI, Multimodal Reasoning, Visual Question Answering