Skip to content

PaperCodex

Subscribe

Video-language Modeling

VLog: Generate Concise, Structured Video Narrations Using Event-Based Vocabulary Instead of Generic Tokens

VLog: Generate Concise, Structured Video Narrations Using Event-Based Vocabulary Instead of Generic Tokens 578

Understanding what happens in videos—especially those capturing everyday human activities—is a core challenge in AI. Most existing video-language models generate…

01/09/2026Event-based Video Understanding, Video Narration, Video-language Modeling
Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos

Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos 3417

If you’re evaluating vision-language models for a project that involves both images and videos, you’ve probably faced a frustrating trade-off:…

12/26/2025Multimodal Understanding, Video-language Modeling, Visual Question Answering
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex