Awesome Multimodal Understanding Papers and Source Codes

Chat-UniVi: One Unified Model for Image and Video Understanding—No More Separate Systems Needed 939

In today’s AI landscape, multimodal systems that understand both images and videos are increasingly essential—but most solutions force you to…

01/13/2026Multimodal Understanding, Video Question Answering, Visual Reasoning

Qwen-Audio: Unified Audio-Language Understanding for Speech, Music, and Environmental Sounds Without Task-Specific Tuning 1848

Audio is one of the richest yet most fragmented modalities in artificial intelligence. Traditional systems often require separate models for…

12/27/2025Audio-Language Modeling, Multimodal Understanding, Universal Audio Recognition

Mini-Omni2: Unified Vision, Speech, and Text Interaction Without External ASR/TTS Pipelines 1847

In today’s open-source AI landscape, building truly multimodal applications often means stitching together separate models for vision, speech recognition (ASR),…

12/26/2025End-to-end Voice Assistant, Multimodal Understanding, Speech-to-speech Interaction

Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos 3417

If you’re evaluating vision-language models for a project that involves both images and videos, you’ve probably faced a frustrating trade-off:…

12/26/2025Multimodal Understanding, Video-language Modeling, Visual Question Answering

InternLM-XComposer: Generate Rich Text-Image Content and Understand High-Res Visuals with Open, Commercially Free AI 2909

Overview For technical decision makers evaluating multimodal AI, choosing between closed-source APIs and open alternatives often means trading off control,…

12/22/2025Multimodal Understanding, Text-image Composition, vision-language modeling

Multimodal Understanding

Chat-UniVi: One Unified Model for Image and Video Understanding—No More Separate Systems Needed 939

Qwen-Audio: Unified Audio-Language Understanding for Speech, Music, and Environmental Sounds Without Task-Specific Tuning 1848

Mini-Omni2: Unified Vision, Speech, and Text Interaction Without External ASR/TTS Pipelines 1847

Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos 3417

InternLM-XComposer: Generate Rich Text-Image Content and Understand High-Res Visuals with Open, Commercially Free AI 2909

SPHINX-X: Build Scalable Multimodal AI Faster with Unified Training, Diverse Data, and Flexible Model Sizes 2794

Show-o: One Unified Transformer for Multimodal Understanding and Generation Across Text, Images, and Videos 1809

Dolphin: Lightweight, Accurate Document Image Parsing for Real-World Mixed-Content Pages 7904

MinerU: High-Precision Open-Source Document Parsing for Real-World PDFs, Tables, and Formulas 50296