BM25S: Ultrafast Lexical Search in Pure Python—No Java, No PyTorch, Just Speed

Paper & Code

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

2024 • xhluca/bm25s

★1354

In today’s world of AI-powered search and retrieval, speed, simplicity, and low resource usage are non-negotiable—especially during prototyping, research, or deployment in constrained environments. Yet many developers still rely on heavyweight search libraries or full-fledged engines like Elasticsearch just to perform basic BM25 ranking. Enter BM25S: a pure-Python, dependency-light implementation of the classic BM25 algorithm that delivers up to 500x faster query throughput than popular alternatives—all while using only NumPy and SciPy.

BM25S isn’t just fast; it’s built for real-world usability. It precomputes and stores BM25 scores as sparse matrices during indexing, shifting the computational burden away from query time. This “eager scoring” strategy enables lightning-fast retrieval, even on large corpora like Natural Questions (2M+ docs) or MSMARCO (8M+ docs), without requiring Java, PyTorch, or complex infrastructure.

If you’ve ever waited seconds for a single query to return results from rank-bm25, or balked at the disk footprint of Elasticsearch for a simple offline retrieval task, BM25S offers a compelling alternative.

Why BM25S Stands Out

Blazing-Fast Performance, Even Against Industry Giants

BM25S redefines what’s possible with pure Python in lexical search. Benchmarks on BEIR datasets show it consistently outperforms other Python libraries by orders of magnitude:

573 queries/second on Arguana (vs. 2 for rank-bm25)
1,196 queries/second on NF Corpus (vs. 224 for rank-bm25)
Competitive with Elasticsearch—often 5–60x faster—despite being written in Python

Even highly optimized Java-based systems struggle to match BM25S’s throughput in single-threaded settings. For teams prioritizing latency in batch retrieval or offline evaluation, this speed translates directly into faster iteration cycles.

Minimal Dependencies, Maximum Portability

BM25S requires only two core dependencies: NumPy and SciPy. There’s no JVM, no CUDA, no PyTorch—just lightweight, installable-in-seconds Python code. Optional extras like PyStemmer or jax (for accelerated top-k selection) are truly optional and don’t bloat your environment.

Compare disk usage after installation:

rank-bm25: ~99 MB
BM25S: ~479 MB
bm25_pt (PyTorch-based): ~5.3 GB
Pyserini (Java-dependent): ~7 GB

This makes BM25S ideal for educational settings, Docker containers, serverless functions, or any environment where minimizing dependencies is critical.

A Practical, Production-Ready Workflow

From Corpus to Ranked Results in Minutes

Using BM25S is intentionally straightforward. Here’s the typical flow:

Tokenize your documents (with optional stemming and stopword removal)
Index the corpus—this is where eager scoring happens
Query with tokenized input and instantly retrieve top-k results

import bm25s
corpus = ["a cat purrs", "a dog barks", ...]
corpus_tokens = bm25s.tokenize(corpus, stopwords="en")
retriever = bm25s.BM25()
retriever.index(corpus_tokens)
results, scores = retriever.retrieve(bm25s.tokenize("does the fish purr?"), k=2)

The entire process takes seconds even for millions of documents—and once indexed, queries return in milliseconds.

Flexible BM25 Variants and Custom Tokenization

BM25S doesn’t assume one-size-fits-all. It supports five BM25 variants from Kamphuis et al. (2020), including:

lucene (default, matching Elasticsearch’s exact scoring)
bm25+, bm25l (with delta tuning)
atire, robertson

Tokenization is equally flexible. You can:

Use built-in multilingual stopwords
Plug in custom stemmers (e.g., PyStemmer)
Define your own splitter or stopword list
Use the Tokenizer class for fine-grained control over vocab building and memory usage

Memory and Storage Efficiency at Scale

For large corpora, BM25S offers memory-mapped (mmap) loading, which keeps the index on disk and loads only needed portions into RAM. On the NQ dataset (2M+ docs):

In-memory loading: 4.45 GB RAM
With mmap: just 0.70 GB RAM (using batched reloading)

This enables retrieval on machines with modest memory—critical for cloud cost optimization or edge deployment.

Disk footprint is equally lean. The indexed sparse matrices are compact, and the entire package installs cleanly without hidden Java runtimes or deep learning frameworks.

Share and Collaborate via Hugging Face Hub

BM25S integrates natively with Hugging Face Hub through the BM25HF class. You can:

Save your index + corpus to a public or private HF repo
Load prebuilt indices from the community
Version your retrieval models like ML checkpoints

This fosters reproducibility—researchers can share not just datasets, but fully functional, fast retrieval systems.

Ideal Use Cases

BM25S excels in scenarios where:

You need fast, offline keyword-based retrieval (e.g., RAG pipelines, document filtering)
You’re prototyping a search feature and want to avoid Elasticsearch setup overhead
Your environment restricts dependencies (e.g., air-gapped systems, minimal containers)
You’re running large-scale retrieval benchmarks and need high QPS without distributed infrastructure
You want a lightweight fallback when semantic search fails or is unavailable

It’s not meant to replace Elasticsearch in complex production systems requiring real-time indexing, faceting, or query parsing—but for static corpora and lexical matching, it’s often more than sufficient.

Limitations to Consider

BM25S is purpose-built for lexical (keyword-based) search. It does not support:

Semantic or embedding-based retrieval
Real-time document updates (indexing is offline-only)
Advanced query features like boolean logic, phrase matching, or highlighting

If your use case demands these, pair BM25S with a vector retriever in a hybrid setup—or stick with a full search engine. But for pure BM25 ranking on fixed corpora, BM25S sets a new standard for speed and simplicity.

Summary

BM25S delivers on a simple promise: make classical BM25 retrieval fast, lightweight, and Python-native. By precomputing scores into sparse matrices and avoiding external dependencies, it achieves performance previously thought impossible in pure Python—outpacing even Java-based systems in query throughput while using a fraction of the memory.

Whether you’re a researcher evaluating retrieval models, an engineer building a lean RAG pipeline, or a student learning information retrieval, BM25S lowers the barrier to high-performance lexical search. With a five-minute install and a few lines of code, you can replace slow, bloated alternatives with a tool built for the modern Python stack.

Install it today with pip install bm25s—and experience BM25 at the speed it was meant to run.