Magika: AI-Powered File Type Detection with 99% Accuracy and Millisecond Speed

Paper & Code

Magika: AI-Powered Content-Type Detection

2024 • google/magika

★9991

Identifying what kind of data is inside a file seems simple—until you’re dealing with corrupted headers, obfuscated malware, or ambiguous text formats. Traditional tools like the Unix file command often rely on magic numbers or file extensions, which can be easily spoofed, missing, or misleading. Enter Magika: a lightweight, AI-driven file type detector developed by Google that delivers ~99% accuracy across more than 200 content types—processing files in just milliseconds, even on a single CPU.

Magika is already in production at massive scale: it’s used by Gmail to triage email attachments, integrated into VirusTotal for malware analysis, and deployed across Google’s Safe Browsing and Drive systems to route files to the right security scanners. If your work involves handling untrusted, heterogeneous, or unlabeled file data—whether in security, DevOps, reverse engineering, or data pipelines—Magika offers a modern, reliable alternative to fragile, rule-based detection methods.

Why Traditional File Detection Falls Short

Classic file-type identification tools suffer from three key limitations:

Fragility: They depend on file signatures at fixed byte offsets. If those bytes are missing or altered (as in packed or encrypted malware), detection fails.
Poor text handling: Formats like Python, JavaScript, or INI files share no consistent binary structure—yet tools like file often mislabel them as “plain text.”
Scalability issues: Rules must be manually curated and updated, making it hard to support new or niche formats without constant maintenance.

Magika solves these problems by replacing heuristic rules with a compact deep learning model trained on ~100 million real-world files. It doesn’t just look at headers—it intelligently samples content from strategic parts of the file to infer its true type, even when metadata is absent or deceptive.

Core Technical Advantages

Speed and Efficiency You Can Deploy Anywhere

Magika’s model weighs only about 1–2 MB and runs inference in ~5 milliseconds per file on a single CPU core. Crucially, inference time is nearly constant regardless of file size—because Magika analyzes only a small, optimized byte window (typically the first few kilobytes plus selected segments). This makes it ideal for high-throughput environments, such as scanning email attachments or processing terabytes of user-uploaded content.

High Accuracy Across Binary and Textual Formats

While traditional tools excel with common binaries (e.g., PNG, PDF), they stumble on code and configuration files. Magika closes this gap: in internal evaluations across 200+ content types—including Python, Bash, Dockerfile, Excel, ELF binaries, and more—it achieves an average F1 score of ~99%. This is especially transformative for textual formats, where context and syntax matter more than magic bytes.

Confidence-Aware Predictions

Magika doesn’t just output a label—it tells you how sure it is. It supports three prediction modes:

high-confidence: Only returns labels when the model’s score exceeds a strict per-type threshold. Ideal for security-critical contexts where false positives are unacceptable.
medium-confidence: Balances precision and coverage for general-purpose use.
best-guess: Always returns the top prediction, even for ambiguous files.

When confidence is too low, Magika falls back to generic labels like “Generic text document” or “Unknown binary data,” preventing misleading classifications.

Practical Use Cases

Security and Threat Intelligence

In malware analysis, correctly identifying a file’s true type is the first step to choosing the right sandbox or disassembler. Magika helps VirusTotal and abuse.ch rapidly classify unknown samples, reducing false negatives from obfuscated payloads.

Content Policy and Compliance

Email and cloud storage platforms must enforce policies based on file type (e.g., blocking executables in attachments). Magika enables Gmail to accurately detect disguised scripts or archives, improving user safety without overblocking legitimate documents.

Developer Tooling and Data Pipelines

Automated build systems, dataset curators, or reverse engineers often process heterogeneous files. With Magika’s CLI (magika -r ./project_dir) or Python API, you can quickly catalog file types, validate input formats, or route files to format-specific processors—all without writing custom parsers.

Getting Started in Seconds

Magika is designed for immediate adoption:

Install the CLI:

pipx install magika
# or
curl -LsSf https://securityresearch.google/magika/install.sh | sh

Scan a directory recursively:
```
magika -r ./my_data/
```

Use in Python:

from magika import Magika
m = Magika()
result = m.identify_path("script.py")
print(result.output.label)  # e.g., "python"

Try it in the browser: A client-side web demo runs Magika locally—no installation required.

Bindings are also available for JavaScript/TypeScript (via npm) and Go (in progress), enabling integration into web apps or cloud services.

Limitations to Keep in Mind

While powerful, Magika isn’t a silver bullet:

It supports 200+ content types, but not every obscure format. Check the GitHub repo for the latest list.
It identifies, but does not validate or parse files. A file labeled “PDF” isn’t guaranteed to be a valid PDF—just that its content strongly resembles one.
The model is periodically updated; staying current via pip install --upgrade magika ensures access to new formats and improved accuracy.

Summary

Magika redefines file type detection by combining deep learning with real-world practicality. With near-perfect accuracy, millisecond latency, minimal resource usage, and production-proven reliability, it solves a quiet but pervasive problem in computing: knowing what you’re actually dealing with. For technical decision-makers in security, DevOps, data engineering, or research, Magika offers a low-friction, open-source path to smarter, safer file handling—without requiring machine learning expertise.