Skip to content

PaperCodex

Subscribe

Multimodal Reasoning

Mini-Monkey: Fixing Fragmented Vision in Lightweight Multimodal Models with Smart Multi-Scale Cropping

Mini-Monkey: Fixing Fragmented Vision in Lightweight Multimodal Models with Smart Multi-Scale Cropping 1923

When it comes to deploying multimodal large language models (MLLMs) in real-world applications—especially on cost-sensitive or edge devices—lightweight models are…

12/22/2025Document Understanding, Multimodal Reasoning, Optical Character Recognition (OCR)
InternGPT: Solve Vision-Centric Tasks with Clicks, Scribbles, and ChatGPT-Level Reasoning

InternGPT: Solve Vision-Centric Tasks with Clicks, Scribbles, and ChatGPT-Level Reasoning 3221

In today’s AI landscape, large language models (LLMs) like ChatGPT have transformed how we interact with software—through natural language. But…

12/19/2025Interactive Image Editing, Multimodal Reasoning, vision-language modeling
MiniCPM-V 4.5: GPT-4o-Level Vision Intelligence in an 8B Open-Source Model for Real-World Multimodal Tasks

MiniCPM-V 4.5: GPT-4o-Level Vision Intelligence in an 8B Open-Source Model for Real-World Multimodal Tasks 22368

Multimodal Large Language Models (MLLMs) promise to transform how machines understand images, videos, and text—but most top-performing models come with…

12/19/2025Efficient MLLM Deployment, Multimodal Reasoning, Vision-language Understanding
Mini-InternVL: Achieve 90% of Multimodal Performance with Just 5% of Model Size for Edge and Consumer Deployments

Mini-InternVL: Achieve 90% of Multimodal Performance with Just 5% of Model Size for Edge and Consumer Deployments 9328

In an era where multimodal large language models (MLLMs) are rapidly advancing, a critical barrier remains: most high-performing vision-language models…

12/18/2025Edge AI, Multimodal Reasoning, vision-language modeling
VLM-R1: Boost Visual Reasoning and Generalization with R1-Style Reinforcement Learning for Vision-Language Models

VLM-R1: Boost Visual Reasoning and Generalization with R1-Style Reinforcement Learning for Vision-Language Models 5743

If you’re working on vision-language tasks that require precise reasoning—like identifying objects based on natural language descriptions, detecting UI defects…

12/18/2025Multimodal Reasoning, Open-Vocabulary Detection, Referring Expression Comprehension
Ovis: Align Vision and Language Embeddings for Superior Multimodal Reasoning Without Proprietary Lock-in

Ovis: Align Vision and Language Embeddings for Superior Multimodal Reasoning Without Proprietary Lock-in 1373

Multimodal Large Language Models (MLLMs) are increasingly vital for tasks that bridge vision and language—yet many struggle to truly fuse…

12/17/2025Multimodal Fine-tuning, Multimodal Reasoning, Vision-language Alignment
UFO: Automate Multi-App Windows Workflows with Natural Language and Zero Human Intervention

UFO: Automate Multi-App Windows Workflows with Natural Language and Zero Human Intervention 7659

Imagine telling your computer what you want it to do—like “Summarize this PDF, email the summary to my manager, and…

12/17/2025Cross-application Task Execution, GUI Automation, Multimodal Reasoning
VideoRAG: Unlock Long-Form Video Understanding with Retrieval-Augmented Generation for AI-Powered Insights

VideoRAG: Unlock Long-Form Video Understanding with Retrieval-Augmented Generation for AI-Powered Insights 1356

Imagine being able to ask questions like “What did the professor say about quantum entanglement in Lecture 3?” or “Show…

12/17/2025Multimodal Reasoning, Retrieval-Augmented Generation, Video Understanding
MMaDA: One Unified Model for Text Reasoning, Multimodal Understanding, and Image Generation

MMaDA: One Unified Model for Text Reasoning, Multimodal Understanding, and Image Generation 1518

Imagine running a single model that can answer complex reasoning questions, understand images and text together, and generate high-quality images…

12/17/2025Diffusion Language Models, Multimodal Reasoning, Text-to-Image Generation
UltraRAG: Build Adaptive, Multimodal RAG Systems Without Writing Complex Code

UltraRAG: Build Adaptive, Multimodal RAG Systems Without Writing Complex Code 2325

Retrieval-Augmented Generation (RAG) has become a cornerstone technique for grounding large language models (LLMs) in real-world knowledge. However, building effective…

12/16/2025Adaptive Knowledge Integration, Multimodal Reasoning, Retrieval-Augmented Generation

Posts pagination

Previous 1 2 3 4 Next
Copyright © 2026 PaperCodex.
  • Facebook
  • YouTube
  • Twitter

PaperCodex