Large Language Models (LLMs) and multimodal systems increasingly demand high-quality, human-authored supervision data—especially for tasks requiring reasoning, visual understanding, and…
Visual Question Answering
Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning 6422
In the rapidly evolving landscape of multimodal artificial intelligence, developers and technical decision-makers need models that go beyond basic image…
Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos 3417
If you’re evaluating vision-language models for a project that involves both images and videos, you’ve probably faced a frustrating trade-off:…
MobileVLM: High-Performance Vision-Language AI That Runs Fast and Privately on Mobile Devices 1314
MobileVLM is a purpose-built vision-language model (VLM) engineered from the ground up for on-device deployment on smartphones and edge hardware.…
LISA: Segment Anything by Understanding What You *Really* Mean 2523
Imagine asking a computer vision system to “segment the object that makes the woman stand higher” or “show me the…
MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference 2282
MoE-LLaVA (Mixture of Experts for Large Vision-Language Models) redefines efficiency in multimodal AI by delivering performance that rivals much larger…
LLaVA-CoT: Step-by-Step Visual Reasoning for Reliable, Explainable Multimodal AI 2108
Most vision-language models (VLMs) today can describe what’s in an image—but they often falter when asked to reason about it.…
Mulberry: Step-by-Step Multimodal Reasoning with o1-Like Reflection for Trustworthy AI Decisions 1217
Traditional multimodal large language models (MLLMs) often produce answers without revealing how they got there—especially when dealing with complex questions…
DeepSeek-VL2: High-Performance Vision-Language Understanding with Efficient Mixture-of-Experts Architecture 5072
DeepSeek-VL2 is an open-source, advanced vision-language model (VLM) built on a Mixture-of-Experts (MoE) architecture, engineered for robust multimodal understanding across…
HealthGPT: Unified Medical Vision-Language Understanding and Generation in a Single Model 1567
HealthGPT is a cutting-edge Medical Large Vision-Language Model (Med-LVLM) designed to tackle a long-standing challenge in AI for healthcare: the…