Skip to content

PaperCodex

Subscribe

Visual Question Answering

FlipVQA-Miner: Automatically Extract High-Quality Visual QA Pairs from Textbooks for Reliable LLM Training

FlipVQA-Miner: Automatically Extract High-Quality Visual QA Pairs from Textbooks for Reliable LLM Training 1737

Large Language Models (LLMs) and multimodal systems increasingly demand high-quality, human-authored supervision data—especially for tasks requiring reasoning, visual understanding, and…

01/04/2026Educational Data Mining, Instruction Tuning, Visual Question Answering
Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning

Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning 6422

In the rapidly evolving landscape of multimodal artificial intelligence, developers and technical decision-makers need models that go beyond basic image…

12/27/2025Multimodal Reasoning, vision-language modeling, Visual Question Answering
Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos

Video-LLaVA: One Unified Model for Both Image and Video Understanding—No More Modality Silos 3417

If you’re evaluating vision-language models for a project that involves both images and videos, you’ve probably faced a frustrating trade-off:…

12/26/2025Multimodal Understanding, Video-language Modeling, Visual Question Answering
MobileVLM: High-Performance Vision-Language AI That Runs Fast and Privately on Mobile Devices

MobileVLM: High-Performance Vision-Language AI That Runs Fast and Privately on Mobile Devices 1314

MobileVLM is a purpose-built vision-language model (VLM) engineered from the ground up for on-device deployment on smartphones and edge hardware.…

12/26/2025Multimodal Reasoning, On-Device AI, Visual Question Answering
LISA: Segment Anything by Understanding What You *Really* Mean

LISA: Segment Anything by Understanding What You *Really* Mean 2523

Imagine asking a computer vision system to “segment the object that makes the woman stand higher” or “show me the…

12/26/2025Multimodal Reasoning, Reasoning Segmentation, Visual Question Answering
MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference

MoE-LLaVA: High-Performance Vision-Language Understanding with Sparse, Efficient Inference 2282

MoE-LLaVA (Mixture of Experts for Large Vision-Language Models) redefines efficiency in multimodal AI by delivering performance that rivals much larger…

12/26/2025Multimodal Reasoning, Object Hallucination Reduction, Visual Question Answering
LLaVA-CoT: Step-by-Step Visual Reasoning for Reliable, Explainable Multimodal AI

LLaVA-CoT: Step-by-Step Visual Reasoning for Reliable, Explainable Multimodal AI 2108

Most vision-language models (VLMs) today can describe what’s in an image—but they often falter when asked to reason about it.…

12/22/2025Explainable AI, Multimodal Reasoning, Visual Question Answering
Mulberry: Step-by-Step Multimodal Reasoning with o1-Like Reflection for Trustworthy AI Decisions

Mulberry: Step-by-Step Multimodal Reasoning with o1-Like Reflection for Trustworthy AI Decisions 1217

Traditional multimodal large language models (MLLMs) often produce answers without revealing how they got there—especially when dealing with complex questions…

12/22/2025Interpretable AI, Multimodal Reasoning, Visual Question Answering
DeepSeek-VL2: High-Performance Vision-Language Understanding with Efficient Mixture-of-Experts Architecture

DeepSeek-VL2: High-Performance Vision-Language Understanding with Efficient Mixture-of-Experts Architecture 5072

DeepSeek-VL2 is an open-source, advanced vision-language model (VLM) built on a Mixture-of-Experts (MoE) architecture, engineered for robust multimodal understanding across…

12/18/2025Document Understanding, Visual Grounding, Visual Question Answering
HealthGPT: Unified Medical Vision-Language Understanding and Generation in a Single Model

HealthGPT: Unified Medical Vision-Language Understanding and Generation in a Single Model 1567

HealthGPT is a cutting-edge Medical Large Vision-Language Model (Med-LVLM) designed to tackle a long-standing challenge in AI for healthcare: the…

12/17/2025Medical Image Generation, Medical Vision-language Modeling, Visual Question Answering
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex