When it comes to deploying multimodal large language models (MLLMs) in real-world applications—especially on cost-sensitive or edge devices—lightweight models are…
Multimodal Reasoning
InternGPT: Solve Vision-Centric Tasks with Clicks, Scribbles, and ChatGPT-Level Reasoning 3221
In today’s AI landscape, large language models (LLMs) like ChatGPT have transformed how we interact with software—through natural language. But…
MiniCPM-V 4.5: GPT-4o-Level Vision Intelligence in an 8B Open-Source Model for Real-World Multimodal Tasks 22368
Multimodal Large Language Models (MLLMs) promise to transform how machines understand images, videos, and text—but most top-performing models come with…
Mini-InternVL: Achieve 90% of Multimodal Performance with Just 5% of Model Size for Edge and Consumer Deployments 9328
In an era where multimodal large language models (MLLMs) are rapidly advancing, a critical barrier remains: most high-performing vision-language models…
VLM-R1: Boost Visual Reasoning and Generalization with R1-Style Reinforcement Learning for Vision-Language Models 5743
If you’re working on vision-language tasks that require precise reasoning—like identifying objects based on natural language descriptions, detecting UI defects…
Ovis: Align Vision and Language Embeddings for Superior Multimodal Reasoning Without Proprietary Lock-in 1373
Multimodal Large Language Models (MLLMs) are increasingly vital for tasks that bridge vision and language—yet many struggle to truly fuse…
UFO: Automate Multi-App Windows Workflows with Natural Language and Zero Human Intervention 7659
Imagine telling your computer what you want it to do—like “Summarize this PDF, email the summary to my manager, and…
VideoRAG: Unlock Long-Form Video Understanding with Retrieval-Augmented Generation for AI-Powered Insights 1356
Imagine being able to ask questions like “What did the professor say about quantum entanglement in Lecture 3?” or “Show…
MMaDA: One Unified Model for Text Reasoning, Multimodal Understanding, and Image Generation 1518
Imagine running a single model that can answer complex reasoning questions, understand images and text together, and generate high-quality images…
UltraRAG: Build Adaptive, Multimodal RAG Systems Without Writing Complex Code 2325
Retrieval-Augmented Generation (RAG) has become a cornerstone technique for grounding large language models (LLMs) in real-world knowledge. However, building effective…