Parsing complex document images—those containing intertwined text paragraphs, tables, mathematical formulas, figures, and code—is a persistent challenge in applied AI. Traditional approaches either stitch together multiple specialized models (increasing integration complexity and latency) or rely on large autoregressive vision-language models that degrade layout fidelity and struggle with efficiency. Enter Dolphin, a 0.3B-parameter open-source multimodal model from ByteDance that rethinks document parsing through a clean, two-stage “analyze-then-parse” workflow. Designed for developers, researchers, and product teams, Dolphin delivers high accuracy, natural reading-order structure, and parallel element decoding—all within a lightweight architecture that runs efficiently on modest GPU hardware.
How Dolphin Solves the Document Parsing Puzzle
The Two-Stage Workflow: Structure First, Content Second
Dolphin’s core innovation lies in its staged approach. Unlike end-to-end generators that blur layout and content, Dolphin decouples analysis from generation:
- Stage 1 (Layout Analysis): Dolphin scans the input document image and outputs a sequence of heterogeneous layout elements—text blocks, tables, formulas, figures—in their natural reading order. This sequence acts as a “structural blueprint” of the page.
- Stage 2 (Parallel Content Parsing): Each element from Stage 1 is treated as a heterogeneous anchor. Dolphin couples these anchors with task-specific prompts (e.g., “parse this as a LaTeX formula” or “convert this to Markdown table”) and parses all elements in parallel.
This design eliminates the trade-off between structural integrity and parsing speed. Layout is preserved faithfully, while content generation benefits from parallelism—making Dolphin both accurate and efficient.
Why This Matters in Practice
For real-world applications, maintaining reading order and element boundaries is non-negotiable. A financial report misordered by a parser could lead to incorrect data interpretation; a research paper with scrambled formulas loses scientific value. Dolphin’s anchor-based prompting ensures each element type is handled with the right “instruction,” avoiding the one-size-fits-all pitfalls of monolithic models.
Key Features That Deliver Real-World Value
Unified Architecture, Multiple Parsing Granularities
Dolphin supports three inference modes out of the box:
- Page-level parsing: Outputs a full structured representation in both JSON and Markdown.
- Layout parsing: Extracts bounding boxes and types of all elements in reading order.
- Element-level parsing: Processes individual crops (e.g., a table image or formula snippet) with type-specific decoding.
This flexibility lets you use Dolphin as a complete document processor or as a specialized element extractor—without switching models.
Built for Speed and Integration
Despite its 0.3B size, Dolphin punches above its weight:
- Parallel element decoding drastically reduces latency compared to sequential autoregressive models.
- Official support for vLLM and TensorRT-LLM enables accelerated inference with minimal setup.
- The model is compatible with Hugging Face Transformers, easing integration into existing pipelines.
Recent benchmarks confirm Dolphin-1.5 (the latest version) achieves state-of-the-art results on OmniDocBench and custom datasets like Fox-Page and Dolphin-Page, with error rates (e.g., Edit Distance) significantly lower than its predecessor.
Multi-Page PDF Ready
As of June 2025, Dolphin natively handles multi-page PDFs, making it suitable for enterprise-scale document processing—not just single-page scans.
Ideal Use Cases
Dolphin excels in scenarios where structured, layout-aware digital conversion of printed documents is needed:
- Enterprise automation: Digitizing invoices, contracts, or regulatory filings into queryable formats.
- Academic and research workflows: Converting PDF papers into editable, structured Markdown with intact formulas and tables.
- RAG preprocessing: Generating clean, semantically segmented input for retrieval-augmented generation systems.
- Accessibility tools: Reconstructing logical reading flow for screen readers or document summarization.
It’s optimized for clean, printed documents—think textbooks, journal articles, or business reports—not handwritten notes or severely degraded scans.
Getting Started in Minutes
Using Dolphin requires minimal setup:
- Clone the repository and install dependencies:
git clone https://github.com/ByteDance/Dolphin.git pip install -r requirements.txt
- Download the Dolphin-1.5 model from Hugging Face:
huggingface-cli download ByteDance/Dolphin-1.5 --local-dir ./hf_model
- Run inference—choose your granularity:
- Page-level:
python demo_page.py --model_path ./hf_model --input_path doc.pdf - Layout-only:
python demo_layout.py --model_path ./hf_model --input_path page.png - Element-specific:
python demo_element.py --element_type table --input_path table_crop.png
- Page-level:
Batch processing (e.g., --max_batch_size 8) is supported for higher throughput.
Limitations to Consider
Dolphin is not a universal OCR tool. It assumes machine-printed input with standard layouts. Performance may degrade on:
- Handwritten content
- Low-quality or heavily skewed scans
- Documents with unconventional design (e.g., posters, comics)
Additionally, while efficient, Dolphin still benefits from GPU acceleration—CPU-only inference is not recommended for production.
The team actively solicits failure cases via GitHub issues to improve robustness, reflecting a commitment to community-driven refinement.
Summary
Dolphin redefines document image parsing by combining layout-aware analysis, parallel content generation, and lightweight efficiency in a single model. For teams tired of juggling fragile pipelines of expert models or sacrificing structure for speed, Dolphin offers a compelling alternative: accurate, fast, and open. Whether you’re building a document intelligence platform or preprocessing data for downstream LLMs, Dolphin provides a robust, production-ready foundation.