Awesome vision-language modeling Papers and Source Codes | Page 2 of 3

Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters 1122

For teams building real-world AI applications that combine vision and language—whether it’s parsing scanned documents, analyzing instructional videos, or creating…

12/27/2025AI Agent Automation, Multimodal Reasoning, vision-language modeling

GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding 1899

If your team is building AI applications that need to see, reason, and act—like desktop assistants that interpret screenshots, UI…

12/27/2025Multimodal Agents, Multimodal Reasoning, vision-language modeling

Sa2VA: Unified Vision-Language Model for Accurate Referring Video Object Segmentation from Natural Language 1455

Sa2VA represents a significant leap forward in multimodal AI by seamlessly integrating the strengths of SAM2—Meta’s state-of-the-art video object segmentation…

12/27/2025Multimodal Grounding, Referring Video Object Segmentation, vision-language modeling

Qwen2-VL: Process Any-Resolution Images and Videos with Human-Like Visual Understanding 17241

Vision-language models (VLMs) are increasingly essential for tasks that require joint understanding of images, videos, and text—ranging from document parsing…

12/26/2025Document Understanding, Multimodal Reasoning, vision-language modeling

MME: The First Comprehensive Benchmark to Objectively Evaluate Multimodal Large Language Models 17004

Multimodal Large Language Models (MLLMs) have captured the imagination of researchers and developers alike—promising capabilities like generating poetry from images,…

12/26/2025Multimodal Evaluation, Multimodal Reasoning, vision-language modeling

InternLM-XComposer: Generate Rich Text-Image Content and Understand High-Res Visuals with Open, Commercially Free AI 2909

Overview For technical decision makers evaluating multimodal AI, choosing between closed-source APIs and open alternatives often means trading off control,…

12/22/2025Multimodal Understanding, Text-image Composition, vision-language modeling

vision-language modeling

Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters 1122

GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding 1899

Sa2VA: Unified Vision-Language Model for Accurate Referring Video Object Segmentation from Natural Language 1455

Qwen2-VL: Process Any-Resolution Images and Videos with Human-Like Visual Understanding 17241

MME: The First Comprehensive Benchmark to Objectively Evaluate Multimodal Large Language Models 17004

InternLM-XComposer: Generate Rich Text-Image Content and Understand High-Res Visuals with Open, Commercially Free AI 2909

InternGPT: Solve Vision-Centric Tasks with Clicks, Scribbles, and ChatGPT-Level Reasoning 3221

SPHINX-X: Build Scalable Multimodal AI Faster with Unified Training, Diverse Data, and Flexible Model Sizes 2794

Caption Anything: Interactive, Multimodal Image Captioning Controlled by You 1770

FastVLM: High-Resolution Vision-Language Inference with 85× Faster Time-to-First-Token and Minimal Compute Overhead 7052