Skip to content

PaperCodex

Subscribe

vision-language modeling

Emu3.5: A Native Multimodal World Model for Unified Vision-Language Generation and Reasoning

Emu3.5: A Native Multimodal World Model for Unified Vision-Language Generation and Reasoning 1372

Imagine a single AI model that doesn’t just “see” or “read”—but seamlessly blends images and text in both input and…

01/04/2026Multimodal Generation, vision-language modeling, World Modeling
AnomalyGPT: Industrial Anomaly Detection Without Manual Thresholds or Labeled Anomalies

AnomalyGPT: Industrial Anomaly Detection Without Manual Thresholds or Labeled Anomalies 1043

In industrial quality control, detecting defects—like cracks in concrete, scratches on metal, or deformities in packaged goods—is critical. Yet traditional…

01/04/2026Few-shot Learning, Industrial Anomaly Detection, vision-language modeling
Mini-Gemini: Close the Gap with GPT-4V and Gemini Using Open, High-Performance Vision-Language Models

Mini-Gemini: Close the Gap with GPT-4V and Gemini Using Open, High-Performance Vision-Language Models 3323

In today’s AI landscape, multimodal systems that understand both images and language are no longer a luxury—they’re a necessity. Yet,…

12/31/2025Document Understanding, Multimodal Reasoning, vision-language modeling
Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning

Qwen-VL: Open-Source Vision-Language AI for Text Reading, Object Grounding, and Multimodal Reasoning 6422

In the rapidly evolving landscape of multimodal artificial intelligence, developers and technical decision-makers need models that go beyond basic image…

12/27/2025Multimodal Reasoning, vision-language modeling, Visual Question Answering
Bunny: High-Performance Multimodal AI Without the Heavy Compute Burden

Bunny: High-Performance Multimodal AI Without the Heavy Compute Burden 1046

Multimodal Large Language Models (MLLMs) are transforming how machines understand and reason about visual content. Yet, their adoption remains out…

12/27/2025Efficient Inference, Multimodal Reasoning, vision-language modeling
Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters

Kimi-VL: High-Performance Vision-Language Reasoning with Only 2.8B Active Parameters 1122

For teams building real-world AI applications that combine vision and language—whether it’s parsing scanned documents, analyzing instructional videos, or creating…

12/27/2025AI Agent Automation, Multimodal Reasoning, vision-language modeling
GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding

GLM-V: Open-Source Vision-Language Models for Real-World Multimodal Reasoning, GUI Agents, and Long-Context Document Understanding 1899

If your team is building AI applications that need to see, reason, and act—like desktop assistants that interpret screenshots, UI…

12/27/2025Multimodal Agents, Multimodal Reasoning, vision-language modeling
Sa2VA: Unified Vision-Language Model for Accurate Referring Video Object Segmentation from Natural Language

Sa2VA: Unified Vision-Language Model for Accurate Referring Video Object Segmentation from Natural Language 1455

Sa2VA represents a significant leap forward in multimodal AI by seamlessly integrating the strengths of SAM2—Meta’s state-of-the-art video object segmentation…

12/27/2025Multimodal Grounding, Referring Video Object Segmentation, vision-language modeling
Qwen2-VL: Process Any-Resolution Images and Videos with Human-Like Visual Understanding

Qwen2-VL: Process Any-Resolution Images and Videos with Human-Like Visual Understanding 17241

Vision-language models (VLMs) are increasingly essential for tasks that require joint understanding of images, videos, and text—ranging from document parsing…

12/26/2025Document Understanding, Multimodal Reasoning, vision-language modeling
MME: The First Comprehensive Benchmark to Objectively Evaluate Multimodal Large Language Models

MME: The First Comprehensive Benchmark to Objectively Evaluate Multimodal Large Language Models 17004

Multimodal Large Language Models (MLLMs) have captured the imagination of researchers and developers alike—promising capabilities like generating poetry from images,…

12/26/2025Multimodal Evaluation, Multimodal Reasoning, vision-language modeling

Posts pagination

1 2 Next
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex