Decompile-Bench: The First Million-Scale Real-World Benchmark for Training and Evaluating LLM-Powered Binary Decompilers

Paper & Code

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

2025 • albertan017/LLM4Decompile

★6178

Decompiling machine code back into human-readable source remains one of the most challenging and valuable tasks in software engineering, cybersecurity, and program analysis. While large language models (LLMs) have shown promising results in decompilation, their progress has been bottlenecked by a critical shortage: the lack of large-scale, high-quality, real-world binary-source code pairs. Synthetic datasets, competition-style benchmarks, or partial mappings (e.g., based only on variable names) fail to reflect the complexity of real compiled binaries—especially under aggressive compiler optimizations like inlining, dead code elimination, or register allocation.

Enter Decompile-Bench: the first open-source, million-scale benchmark specifically designed for real-world binary decompilation. Built from 450GB of binaries compiled from permissively licensed GitHub projects, Decompile-Bench delivers two million accurately aligned binary-source function pairs for training and 70,000 pairs for evaluation. This dataset directly addresses the data gap that has hindered the development and fair comparison of LLM-based decompilers—making it an essential resource for researchers, security engineers, and AI-for-code practitioners.

Why Real-World Binary-Source Pairs Matter

Traditional decompilation benchmarks often rely on synthetic code or contest problems (e.g., adapted from programming challenges), which rarely capture the chaotic reality of industrial codebases: third-party libraries, complex control flow, macros, and user-defined types. Worse, function inlining—a common compiler optimization—breaks the one-to-one mapping between source functions and binary symbols, rendering many datasets misaligned or unusable.

Decompile-Bench solves this by carefully reconstructing true binary-source correspondences from real open-source C projects. Every pair has been validated to ensure functional equivalence, enabling models trained on this data to learn decompilation patterns that generalize beyond toy examples. As a result, fine-tuning on Decompile-Bench yields a 20% absolute improvement in re-executability rate—a key metric that measures whether decompiled code passes original test cases—compared to models trained on prior benchmarks.

Key Features That Set Decompile-Bench Apart

Scale and Authenticity

With 2 million training pairs distilled from 100 million raw candidates, Decompile-Bench is the largest decompilation dataset to date. All binaries were compiled from real GitHub repositories using standard GCC toolchains, preserving realistic compilation artifacts and optimization behaviors.

Multi-Optimization Coverage

The dataset includes binaries compiled at all four GCC optimization levels (O0, O1, O2, O3), reflecting how code transforms under different compiler settings. This is crucial because decompilation difficulty varies dramatically with optimization—for example, O3 aggressively inlines functions and removes debug symbols, while O0 preserves structure. Supporting all levels ensures robust evaluation across real-world scenarios.

Evaluation Designed to Prevent Data Leakage

The evaluation subset, Decompile-Bench-Eval, combines two carefully curated sources:

HumanEval-Decompile: 164 functions using only standard C libraries, ideal for controlled testing.
ExeBench: 2,621 functions from post-2025 GitHub repositories, ensuring no overlap with training data.

This dual-track design validates both basic correctness and real-world generalization—addressing a common pitfall in AI-for-code benchmarks where test data leaks into training sets.

Tight Integration with LLM4Decompile Models

Decompile-Bench isn’t just a dataset—it’s part of a full ecosystem. The companion LLM4Decompile series (1.3B to 22B parameters) demonstrates state-of-the-art results when trained on this benchmark. For example, the llm4decompile-9b-v2 model achieves a 64.9% re-executability rate, meaning over two-thirds of its decompiled functions pass all original test assertions.

Ideal Use Cases for Technical Decision-Makers

Training Specialized Decompilation Models

If you’re developing an LLM for reverse engineering, Decompile-Bench provides the only publicly available, large-scale source of truth for binary-to-C translation. Fine-tuning on this data dramatically boosts functional correctness—critical for applications like vulnerability analysis or legacy system recovery.

Benchmarking Decompiler Performance

Security teams evaluating commercial or open-source decompilers can use Decompile-Bench-Eval as a standardized testbed. Unlike ad-hoc evaluations, its re-executability metric objectively measures whether decompiled code actually works, not just whether it looks plausible.

Building AI-Augmented Reverse Engineering Tools

Tool developers integrating LLMs into Ghidra, IDA Pro, or Binary Ninja can leverage Decompile-Bench to train models that refine Ghidra’s pseudo-C output (as done in the V2 series). This hybrid approach—combining symbolic disassembly with neural refinement—delivers higher accuracy at lower compute cost.

How to Get Started

Getting started with Decompile-Bench is straightforward:

Install the environment via Conda or Docker:

git clone https://github.com/albertan017/LLM4Decompile.git  
cd LLM4Decompile  
conda create -n 'llm4decompile' python=3.9 -y  
conda activate llm4decompile  
pip install -r requirements.txt

Preprocess your binary: Compile C code with GCC (O0–O3), then disassemble with objdump to extract assembly for a target function. The toolkit includes scripts to clean assembly by removing addresses and comments, formatting it as:
```
<func_name>:  
lea (%rdi,%rsi,1),%eax  
retq  
```
Run inference with a pre-trained LLM4Decompile model (e.g., LLM4Binary/llm4decompile-6.7b-v1.5): feed the formatted assembly, and the model outputs human-readable C.

For quick prototyping, a Colab notebook and Docker image are available, and the 100k-sample subset (decompile-ghidra-100k) enables full training runs under $20 on a single A100 GPU.

Limitations and Considerations

While Decompile-Bench represents a major leap forward, adopters should note:

Platform and language scope: Currently supports Linux x86_64 binaries → C source only. Architectures like ARM or languages like C++ are not covered yet.
Function boundary dependency: Accurate decompilation assumes correct extraction of function-level assembly via tools like objdump. Heavily inlined or obfuscated functions may lack clean boundaries.
Optimization-level awareness: Most models perform best when the optimization level (O0–O3) is known at inference time. The llm4decompile-6.7b-uo variant mitigates this by training without optimization-level labels, though at a slight performance cost (~21.9% re-executability).

Summary

Decompile-Bench fills a critical void in the decompilation landscape by providing the first large-scale, real-world benchmark of aligned binary-source function pairs. Its scale, authenticity, and rigorous evaluation framework empower developers to train, test, and deploy LLM-based decompilers with confidence. For anyone working in reverse engineering, software maintenance, or AI-for-code, Decompile-Bench isn’t just another dataset—it’s the foundation for the next generation of intelligent decompilation tools.