Awesome Video-language Modeling Papers and Source Codes

VLog: Generate Concise, Structured Video Narrations Using Event-Based Vocabulary Instead of Generic Tokens 578

Understanding what happens in videos—especially those capturing everyday human activities—is a core challenge in AI. Most existing video-language models generate…

01/09/2026Event-based Video Understanding, Video Narration, Video-language Modeling

Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos 3417

If you’re evaluating vision-language models for a project that involves both images and videos, you’ve probably faced a frustrating trade-off:…

12/26/2025Multimodal Understanding, Video-language Modeling, Visual Question Answering