Dolphin: Lightweight, Accurate Document Image Parsing for Real-World Mixed-Content Pages

Paper & Code

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

2025 • ByteDance/Dolphin

★7904

Parsing complex document images—those containing intertwined text paragraphs, tables, mathematical formulas, figures, and code—is a persistent challenge in applied AI. Traditional approaches either stitch together multiple specialized models (increasing integration complexity and latency) or rely on large autoregressive vision-language models that degrade layout fidelity and struggle with efficiency. Enter Dolphin, a 0.3B-parameter open-source multimodal model from ByteDance that rethinks document parsing through a clean, two-stage “analyze-then-parse” workflow. Designed for developers, researchers, and product teams, Dolphin delivers high accuracy, natural reading-order structure, and parallel element decoding—all within a lightweight architecture that runs efficiently on modest GPU hardware.

How Dolphin Solves the Document Parsing Puzzle

The Two-Stage Workflow: Structure First, Content Second

Dolphin’s core innovation lies in its staged approach. Unlike end-to-end generators that blur layout and content, Dolphin decouples analysis from generation:

Stage 1 (Layout Analysis): Dolphin scans the input document image and outputs a sequence of heterogeneous layout elements—text blocks, tables, formulas, figures—in their natural reading order. This sequence acts as a “structural blueprint” of the page.
Stage 2 (Parallel Content Parsing): Each element from Stage 1 is treated as a heterogeneous anchor. Dolphin couples these anchors with task-specific prompts (e.g., “parse this as a LaTeX formula” or “convert this to Markdown table”) and parses all elements in parallel.

This design eliminates the trade-off between structural integrity and parsing speed. Layout is preserved faithfully, while content generation benefits from parallelism—making Dolphin both accurate and efficient.

Why This Matters in Practice

For real-world applications, maintaining reading order and element boundaries is non-negotiable. A financial report misordered by a parser could lead to incorrect data interpretation; a research paper with scrambled formulas loses scientific value. Dolphin’s anchor-based prompting ensures each element type is handled with the right “instruction,” avoiding the one-size-fits-all pitfalls of monolithic models.

Key Features That Deliver Real-World Value

Unified Architecture, Multiple Parsing Granularities

Dolphin supports three inference modes out of the box:

Page-level parsing: Outputs a full structured representation in both JSON and Markdown.
Layout parsing: Extracts bounding boxes and types of all elements in reading order.
Element-level parsing: Processes individual crops (e.g., a table image or formula snippet) with type-specific decoding.

This flexibility lets you use Dolphin as a complete document processor or as a specialized element extractor—without switching models.

Built for Speed and Integration

Despite its 0.3B size, Dolphin punches above its weight:

Parallel element decoding drastically reduces latency compared to sequential autoregressive models.
Official support for vLLM and TensorRT-LLM enables accelerated inference with minimal setup.
The model is compatible with Hugging Face Transformers, easing integration into existing pipelines.

Recent benchmarks confirm Dolphin-1.5 (the latest version) achieves state-of-the-art results on OmniDocBench and custom datasets like Fox-Page and Dolphin-Page, with error rates (e.g., Edit Distance) significantly lower than its predecessor.

Multi-Page PDF Ready

As of June 2025, Dolphin natively handles multi-page PDFs, making it suitable for enterprise-scale document processing—not just single-page scans.

Ideal Use Cases

Dolphin excels in scenarios where structured, layout-aware digital conversion of printed documents is needed:

Enterprise automation: Digitizing invoices, contracts, or regulatory filings into queryable formats.
Academic and research workflows: Converting PDF papers into editable, structured Markdown with intact formulas and tables.
RAG preprocessing: Generating clean, semantically segmented input for retrieval-augmented generation systems.
Accessibility tools: Reconstructing logical reading flow for screen readers or document summarization.

It’s optimized for clean, printed documents—think textbooks, journal articles, or business reports—not handwritten notes or severely degraded scans.

Getting Started in Minutes

Using Dolphin requires minimal setup:

Clone the repository and install dependencies:

git clone https://github.com/ByteDance/Dolphin.git
pip install -r requirements.txt

Download the Dolphin-1.5 model from Hugging Face:

huggingface-cli download ByteDance/Dolphin-1.5 --local-dir ./hf_model

Run inference—choose your granularity:
- Page-level: python demo_page.py --model_path ./hf_model --input_path doc.pdf
- Layout-only: python demo_layout.py --model_path ./hf_model --input_path page.png
- Element-specific: python demo_element.py --element_type table --input_path table_crop.png

Batch processing (e.g., --max_batch_size 8) is supported for higher throughput.

Limitations to Consider

Dolphin is not a universal OCR tool. It assumes machine-printed input with standard layouts. Performance may degrade on:

Handwritten content
Low-quality or heavily skewed scans
Documents with unconventional design (e.g., posters, comics)

Additionally, while efficient, Dolphin still benefits from GPU acceleration—CPU-only inference is not recommended for production.

The team actively solicits failure cases via GitHub issues to improve robustness, reflecting a commitment to community-driven refinement.

Summary

Dolphin redefines document image parsing by combining layout-aware analysis, parallel content generation, and lightweight efficiency in a single model. For teams tired of juggling fragile pipelines of expert models or sacrificing structure for speed, Dolphin offers a compelling alternative: accurate, fast, and open. Whether you’re building a document intelligence platform or preprocessing data for downstream LLMs, Dolphin provides a robust, production-ready foundation.