MME: The First Comprehensive Benchmark to Objectively Evaluate Multimodal Large Language Models

Paper & Code

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

2024 • BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation

★17004

Multimodal Large Language Models (MLLMs) have captured the imagination of researchers and developers alike—promising capabilities like generating poetry from images, answering complex visual questions, or even reasoning about scenes in real-world applications. But behind the impressive demos often lies a troubling question: How do you actually know which model performs best?

Until recently, there was no standardized, comprehensive way to fairly compare MLLMs across a wide range of visual and linguistic abilities. Most evaluations relied on cherry-picked examples or repurposed datasets that risk data leakage, making comparisons unreliable.

Enter MME (Multimodal Large Language Model Evaluation)—the first comprehensive benchmark explicitly designed to measure both perception and cognition in MLLMs. Developed by a team of researchers and actively maintained on GitHub, MME provides a transparent, repeatable, and bias-resistant evaluation framework that cuts through marketing claims and reveals real-world performance differences between models like GPT-4V, Gemini, LLaVA, Qwen-VL, and dozens of others.

If you’re a technical decision-maker—whether in AI product development, academic research, or enterprise engineering—you need a trustworthy way to assess, select, and improve multimodal models. MME gives you exactly that.

Why Traditional MLLM Evaluations Fall Short

Many existing benchmarks suffer from three key limitations:

Data leakage: Reusing images or questions from public datasets (e.g., VQA, COCO) means models may have already seen the test data during training, inflating scores artificially.
Prompt engineering bias: Performance often depends more on how cleverly you phrase your prompt than on the model’s actual capability, making comparisons unfair.
Narrow scope: Many tests focus only on object recognition or simple QA, ignoring higher-order skills like numerical reasoning, code generation, or text translation from images.

MME directly addresses each of these issues—making it uniquely suited for rigorous, real-world model assessment.

Core Features That Make MME Uniquely Reliable

1. 14 Subtasks Spanning Perception and Cognition

MME evaluates models across 14 distinct subtasks, grouped into two categories:

Perception: Tests low-level and mid-level visual understanding:
- Existence (Is an object present?)
- Count (How many objects?)
- Position (Where is the object?)
- Color, Scene, Landmark, Artwork, OCR, Poster recognition, and Celebrity identification
Cognition: Assesses high-level reasoning and language skills:
- Commonsense reasoning
- Numerical calculation
- Text translation (e.g., translating text in an image)
- Code reasoning (e.g., interpreting or generating code from diagrams)

This dual-layer design ensures models are tested not just on what they see, but on how they think about what they see.

2. Human-Written Instructions to Prevent Data Leakage

All instruction-answer pairs in MME are manually designed from scratch—not sourced from existing datasets. This eliminates the risk of models having encountered the evaluation data during training, ensuring clean, unbiased results.

3. Standardized, Concise Prompts for Fair Comparisons

Every model receives the exact same instruction format, phrased as simply and directly as possible (e.g., “What color is the car?” instead of elaborate prompts). This removes prompt engineering as a variable, enabling true apples-to-apples comparisons.

4. Quantitative Scoring and Public Leaderboards

MME assigns a numeric score (0–200) for each subtask, with a maximum total of 2,800 (2,000 for perception + 800 for cognition). The benchmark includes a public leaderboard with results for 50+ MLLMs, updated regularly as new models are added. This transparency allows anyone to audit, reproduce, and contextualize performance claims.

For example, as of mid-2024:

Qwen-VL-Max leads in overall perception (1,790/2,000)
MiniCPM-V 2.6 excels in cognition (697.86/800), particularly in OCR and numerical tasks
GPT-4V, despite its reputation, scores 0 on the Celebrity subtask—because it refuses to identify individuals, highlighting how MME reveals behavioral limitations beyond raw accuracy

Practical Use Cases for Technical Decision-Makers

MME isn’t just an academic exercise—it solves real problems in industry and research:

Model selection for product teams: Choosing between LLaVA-NeXT and Qwen-VL for a visual assistant? MME shows Qwen-VL outperforms in OCR and landmark recognition, while LLaVA may be better in commonsense reasoning.
Research validation: Before claiming a new architecture improves MLLM performance, test it on MME to prove gains are genuine and not artifacts of prompt tuning.
Debugging model weaknesses: If your model fails in “Position” tasks, you know to focus on spatial reasoning datasets during fine-tuning.
Benchmarking against SOTA: The leaderboard lets you instantly see how your in-house model stacks up against GPT-4V, Gemini Pro, InternVL, and others—without running them yourself.

How to Get Started with MME

Using MME in your workflow is straightforward:

Access the resources: Visit the official GitHub repository at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation
Download the dataset: It includes curated images and human-written instruction-answer pairs across all 14 subtasks
Run inference: Feed your MLLM the standardized prompts along with corresponding images
Score responses: Use the provided evaluation script to auto-calculate scores per subtask
Compare: Benchmark your results against the public leaderboard

No training is required—only inference and evaluation. This makes MME ideal for quick validation cycles.

Important Limitations to Consider

While powerful, MME has boundaries:

Static images only: The core MME benchmark focuses on single images. For video understanding, use Video-MME (a newer extension supporting videos up to 1 hour).
English-only instructions: All prompts are in English, so performance on multilingual tasks isn’t assessed.
Fixed dataset size: The current dataset, while high-quality, isn’t exhaustive. It’s best used as a diagnostic tool, not a replacement for application-specific testing.
No real-time interaction: MME evaluates static responses, not conversational or iterative reasoning.

That said, the MME team has already expanded the ecosystem with MME-RealWorld (for high-resolution, real-world scenarios) and Video-MME, showing a commitment to evolving with the field.

Summary

MME solves a critical problem in the multimodal AI landscape: how to fairly, reliably, and comprehensively evaluate MLLMs beyond flashy demos. By combining human-curated data, standardized prompts, and a balanced mix of perception and cognition tasks, it offers engineers, researchers, and product leaders an unprecedented window into model capabilities—and limitations.

If you’re choosing a foundation model for a vision-language application, validating a research claim, or debugging a multimodal pipeline, MME gives you the evidence you need to make confident, data-driven decisions. In a field crowded with hype, it’s a rare source of clarity.

Start exploring MME today—and stop guessing which MLLM truly delivers.