MinerU: High-Precision Open-Source Document Parsing for Real-World PDFs, Tables, and Formulas

Paper & Code

MinerU: An Open-Source Solution for Precise Document Content Extraction

2024 • opendatalab/MinerU

★50296

Converting real-world documents—especially PDFs containing mixed content like equations, tables, multi-column layouts, and scanned text—into clean, structured, machine-readable formats remains a persistent headache for engineers, researchers, and product teams. Commercial OCR and parsing tools often come with usage limits, cost barriers, or black-box behavior. Meanwhile, many open-source alternatives struggle with accuracy, robustness, or scalability across diverse document types.

Enter MinerU: an open-source, actively maintained document understanding toolkit designed not just for academic benchmarks, but for real-world engineering use. Built on top of the robust PDF-Extract-Kit foundation and continuously refined through thousands of community-reported edge cases, MinerU delivers high-precision extraction of text, tables, mathematical formulas, and layout structures—while offering unmatched flexibility in deployment, hardware support, and output control.

Unlike naive PDF-to-text converters, MinerU intelligently handles document complexity: it distinguishes headers from body text, reconstructs reading order across columns and pages, merges fragmented tables, and renders LaTeX from handwritten or typeset formulas—all while supporting over 100 languages.

Why Existing Tools Fall Short

Many document parsing pipelines fail when faced with:

Borderless or rotated tables that break traditional rule-based detectors.
Mixed Chinese-English formulas that confuse generic OCR engines.
Scanned PDFs with skewed or low-resolution content, where layout and text overlap.
Multi-column academic papers where reading order must be inferred, not assumed.
Inconsistent outputs due to missing postprocessing logic (e.g., orphaned captions, split paragraphs).

Generic OCR libraries like Tesseract or even PaddleOCR alone cannot resolve these challenges—they extract raw text but ignore semantic structure. End-to-end vision-language models (VLMs) often hallucinate or omit details. MinerU bridges this gap by combining modular, specialized models with carefully tuned preprocessing and postprocessing rules, ensuring both fidelity and usability.

Core Capabilities That Solve Real Problems

Accurate Table Parsing—Even in Tough Cases

MinerU supports wired, borderless, semi-structured, and rotated tables (0°, 90°, 270°). It can merge tables split across pages and correctly associate captions and footnotes—even when multiple tables appear on a single page. Recent updates (v2.6.2+) significantly improved parsing of long, complex financial or scientific tables, with output rendered cleanly in HTML for direct web integration.

Reliable Formula Recognition in LaTeX

Mathematical expressions—especially long, nested, or bilingual (Chinese-English) formulas—are accurately converted to LaTeX using the UniMERNet-based formula recognition engine. MinerU 2.5 further boosts performance on hybrid notation, making it ideal for digitizing academic literature or STEM textbooks. Experimental Chinese formula support can be enabled via environment variable when needed.

Semantic Structure Preservation

MinerU doesn’t just extract text—it reconstructs document semantics. It identifies:

Headings (with hierarchical classification)
Paragraphs and lists (including cross-column continuation)
Images with their descriptive captions
Footnotes and references

The output respects human-readable reading order, using layout-aware sorting (via layoutreader) rather than naive top-to-bottom scanning. This is critical for multi-column journals, technical manuals, or legal documents.

Multilingual OCR with Major Accuracy Gains

Leveraging PPOCRv5, MinerU supports 109+ languages, including Latin, Cyrillic, Arabic, Devanagari, Thai, and Greek. Recent releases improved Latin-script accuracy by 11% and boosted Cyrillic, Arabic, Telugu, and Tamil recognition by over 40% compared to prior models. Handwritten text is also supported, with automatic layout adaptation for handwritten zones.

Smart Document Mode Detection

MinerU automatically detects whether a PDF is digital (text-based) or scanned (image-based). If the latter, it seamlessly enables OCR without manual intervention. Users can also override this via CLI or API for fine-grained control.

Flexible Output Formats

Choose from:

Markdown (semantic, multimodal-friendly)
JSON (content_list.json) with bounding boxes (bbox normalized to 0–1000)
Intermediate layout JSON (middle.json) for custom downstream logic
Visualized PDFs showing detected spans and regions

This enables seamless integration into RAG pipelines, knowledge bases, or data labeling workflows.

Performance and Deployment Flexibility

MinerU offers two complementary backends, letting users trade off speed, accuracy, and hardware constraints:

Pipeline Mode (~82 OmniDocBench Score)

CPU-friendly, runs with as little as 6GB GPU VRAM (or pure CPU)
Uses optimized ONNX models for layout, OCR, table, and formula tasks
Ideal for edge devices, batch processing, or environments where determinism matters

VLM Mode (~90+ OmniDocBench Score with MinerU2.5)

Powered by the 1.2B-parameter MinerU2.5 model, which outperforms 72B+ VLMs like GPT-4o and Qwen2.5-VL on document parsing
Supports acceleration via vLLM, LMDeploy, MLX (Apple Silicon), or HTTP clients
Best for high-accuracy scenarios: academic publishing, regulatory compliance, or scientific data extraction

Recent optimizations have made MinerU dramatically faster:

OCR speed increased by 200–300%
Formula parsing up to 1400% faster in batch mode
Reduced CPU contention in high-concurrency API deployments via thread tuning

Deployment options include:

Command line: mineru -p input.pdf -o output/
FastAPI server: with auto-generated docs (toggle via MINERU_API_ENABLE_FASTAPI_DOCS)
Gradio WebUI: for quick local testing
Docker: for reproducible cloud or on-prem setups

Environment variables allow fine control over timeouts (MINERU_PDF_RENDER_TIMEOUT), concurrency (MINERU_API_MAX_CONCURRENT_REQUESTS), and feature toggles (e.g., MINERU_TABLE_MERGE_ENABLE=0 to disable table merging).

Practical Use Cases

MinerU shines in scenarios where structure, accuracy, and automation matter:

Academic digitization: Extract equations, references, and multi-column layouts from arXiv papers or journal PDFs without losing semantics.
Financial & legal document processing: Parse complex tables in earnings reports or regulatory filings, even when rotated or split across pages.
Enterprise knowledge management: Convert internal PDF archives into structured Markdown or JSON for RAG-powered search.
SaaS product integration: Replace costly commercial APIs with an open, auditable, and customizable parsing engine.
Multilingual research: Handle documents mixing English text with Chinese formulas or Arabic footnotes seamlessly.

Known Limitations

While powerful, MinerU isn’t a magic bullet. Be aware of these constraints:

Photographed documents (e.g., smartphone captures of whiteboards) are not well supported.
Highly stylized layouts like comics, children’s textbooks, or art portfolios may parse poorly.
Vertical text has limited (experimental) support.
Rare list formats or code blocks may not be recognized reliably.
Extremely complex tables can occasionally misalign rows/columns.
VLM backend requires Linux/WSL2 for best vLLM performance; Windows native support is limited to LMDeploy.

These are clearly documented in the project’s “Known Issues” and continuously improved through community feedback.

Getting Started Is Effortless

Install with one command:
```
uv pip install -U "mineru[core]"  
```
Download models automatically:
```
mineru models download  
```
Parse a PDF:
```
mineru -p document.pdf -o ./output  
```
(Optional) Launch a local WebUI:
```
mineru webui  
```

No JSON config editing. No manual model wrangling. Just results.

Summary

MinerU solves the messy reality of document understanding with an open, precise, and engineer-friendly toolkit. It combines specialized models, robust postprocessing, and deployment flexibility to deliver consistent, high-quality extraction across scientific papers, financial reports, legal documents, and multilingual content. Whether you’re building a research pipeline or a commercial SaaS product, MinerU offers a transparent, performant, and actively maintained alternative to closed-source black boxes—without sacrificing accuracy or control.