Omnilingual ASR: Open-Source Speech Recognition for 1,600+ Languages—Including 500 Never Before Supported

Paper & Code

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

2025 • facebookresearch/omnilingual-asr

★2504

For decades, automatic speech recognition (ASR) has flourished in high-resource languages like English, Spanish, or Mandarin. But for the vast majority of the world’s 7,000+ languages—particularly those spoken by small or marginalized communities—speech technology has remained out of reach. Building ASR systems for these “long-tail” languages has traditionally required massive labeled datasets, specialized engineering teams, and significant compute, placing them beyond the reach of local researchers, educators, or grassroots organizations.

Omnilingual ASR changes this equation. Developed by Meta’s AI research team and released as open-source under the Apache 2.0 license, it is the first large-scale multilingual ASR system explicitly designed for extensibility and inclusivity. Supporting over 1,600 languages—including more than 500 never before covered by any ASR system—it enables new languages to be added with just a handful of audio-text examples, no large dataset or machine learning expertise required.

Built on a foundation of self-supervised learning scaled to 7 billion parameters and a decoder inspired by large language models (LLMs), Omnilingual ASR achieves strong zero-shot generalization. This means it can transcribe speech in languages it has never seen during training, dramatically lowering the barrier to entry for under-resourced linguistic communities.

Why Omnilingual ASR Matters

Most commercial and open-source ASR systems today cover fewer than 100 languages. This leaves thousands of linguistic communities without access to voice assistants, transcription tools, or speech-enabled educational software. The gap isn’t just technical—it’s ethical. When speech technology excludes entire populations, it reinforces digital inequity.

Omnilingual ASR directly confronts this problem. By combining a massive and linguistically diverse training corpus with architectural innovations, it delivers usable transcription quality even for languages with minimal digital presence. Critically, the project incorporates community-sourced recordings gathered through compensated local partnerships, ensuring that expansion isn’t extractive but collaborative.

For project leads, researchers, and engineers working in global development, linguistics, education, or digital inclusion, this system offers a rare opportunity: the chance to deploy speech recognition in contexts where it was previously impossible.

Key Technical Capabilities

Broad Language Coverage with Standardized Identifiers

Omnilingual ASR supports 1,600+ languages, each identified using the {language_code}_{script} format (e.g., eng_Latn for English in Latin script, cmn_Hans for Simplified Chinese). This standardization simplifies integration into multilingual pipelines and ensures clarity across scripts and dialects.

You can programmatically check support:

from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs  
print(f"Total supported languages: {len(supported_langs)}")

Zero-Shot Language Addition

One of the most powerful features is the ability to add a new language using only a few (as few as 5–10) paired audio-text samples. Thanks to its LLM-inspired decoder and robust speech representations learned during pre-training, the model generalizes effectively to unseen languages—no retraining required.

This capability is especially valuable for NGOs documenting endangered languages or startups building voice interfaces for regional markets without existing ASR infrastructure.

Flexible Model Family for Diverse Hardware

Omnilingual ASR ships as a family of models, balancing accuracy and efficiency:

300M variants: Suitable for mobile or edge devices (~2–5 GB VRAM)
1B–3B variants: Ideal for standard cloud inference
7B variants: Deliver state-of-the-art accuracy (CER <10 for 78% of languages) but require ~17 GB VRAM

Two architectural flavors are available:

CTC models: Simpler, faster, but less adaptable to low-resource languages
LLM-based models: Support optional language conditioning, better zero-shot performance, and more coherent output

Additionally, the newly released “Unlimited” models (omniASR_LLM_Unlimited_*_v2) can process audio of arbitrary length—critical for interviews, lectures, or oral histories—though fine-tuning support for these variants is not yet available.

Getting Started Quickly

Installation is straightforward:

pip install omnilingual-asr

Basic inference takes just a few lines:

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline  

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_v2")  
audio_files = ["/path/to/audio1.flac", "/path/to/audio2.wav"]  
langs = ["eng_Latn", "fra_Latn"]  
transcriptions = pipeline.transcribe(audio_files, lang=langs, batch_size=2)

For evaluation, you can directly use the facebook/omnilingual-asr-corpus dataset from Hugging Face:

pip install "omnilingual-asr[data]"

This dataset, released under CC-BY-4.0, includes community-contributed speech across hundreds of languages and is ideal for benchmarking or prototyping.

Practical Use Cases

Community-led language preservation: Local groups can transcribe oral histories or educational content in their native tongue, even if no ASR existed before.
Global research and fieldwork: Linguists or anthropologists can transcribe interviews in real time, accelerating analysis without manual transcription.
Multilingual customer service: Enterprises can index or analyze voice interactions across diverse regional languages without building custom models per language.
Rapid prototyping: Startups can validate voice product ideas in new markets using minimal data, reducing time-to-market and R&D costs.

Limitations and Considerations

While powerful, Omnilingual ASR has practical constraints:

Audio length: Standard CTC and LLM models accept only audio under 40 seconds. Use the “Unlimited” variants for longer recordings, but note that fine-tuning is currently unsupported for these.
Hardware demands: The 7B LLM-ASR model requires ~17 GB of GPU memory, making smaller variants (300M or 1B) more practical for edge deployment.
Zero-shot isn’t perfect: While the system works out-of-the-box for new languages, transcription quality improves significantly with even modest community validation or example refinement.

Why Open Source and Ethics Are Central

Omnilingual ASR is released under the Apache 2.0 license—code, models, and training data are openly available. This transparency empowers researchers, local developers, and community organizations to inspect, modify, and redistribute the technology without legal barriers.

Meta’s approach also emphasizes ethical co-creation: recordings were gathered through compensated partnerships with local speakers, and the open release invites communities to take ownership of their linguistic digital future. In a field often dominated by closed, commercial systems, this commitment to openness and collaboration is both rare and transformative.

Summary

Omnilingual ASR is more than a technical achievement—it’s a step toward equitable speech technology. By supporting 1,600+ languages, enabling zero-shot language addition, and offering scalable models for diverse hardware, it empowers a new generation of builders to create voice-enabled applications for the world’s linguistic majority. Whether you’re a researcher, product developer, or community advocate, this open-source toolkit lowers the barrier to entry and invites participation in shaping a more inclusive AI future.