Skip to content

Advanced★★★★10 min read

Hybrid search (BM25 + vector) for RAG systems

Pure vector search misses keywords. Pure keyword search misses semantics. Combine them — here's the recipe.

If you've shipped a RAG system that uses pure vector search, you've probably been surprised when a user types a clear keyword query like "refund policy" and the retrieval misses the document literally titled "Refund Policy." That's not a bug; that's vector search behaving as designed.

Hybrid search — combining BM25 (lexical / keyword) with dense vectors (semantic) — solves this. It's not a 2026 invention; it's been the gold standard in production RAG systems for two years. If you're building serious retrieval, you should use it.

Why pure vector search misses things

Dense embeddings are good at capturing meaning. They are bad at capturing exact identifiers:

  • Product SKU XB-447-Z. Embedding model has never seen it; cosine similarity to anything is meaningless.
  • Person name "Ahmadinejad." Rare token; embedding has high variance.
  • Acronym RAG versus the word rag. Same embedding for both.
  • Recent news "DeepSeek V4 release" — if it's after the embedding model's training cutoff, it's basically random.

BM25 is the opposite. It excels at exact matches. "refund policy" finds documents containing those exact words. It misses paraphrases — "return process" won't match "refund policy" even though they mean the same thing.

Hybrid retrieval gets you both.

How BM25 works in 30 seconds

BM25 (Best Match 25) is a 1990s scoring function based on term frequency. For a query Q and document D:

  1. For each term in Q, count how often it appears in D (term frequency).
  2. Penalize common terms (the, a) with inverse document frequency.
  3. Normalize for document length (avoid bias toward long docs).
  4. Sum scores per term.

It's a single equation, runs in milliseconds on millions of docs with the right index, and has been the default search algorithm in Elasticsearch and Lucene for decades. Modern variants (BM25F, BM25+) tweak edge cases but the core idea is unchanged.

For BM25 you don't need GPUs, embeddings, or training. Just an inverted index.

The recipe: parallel + combine

The basic hybrid pattern:

  1. Send the user's query to both systems in parallel.
  2. Each returns its top K results with a score.
  3. Combine the rankings into a single list.
  4. Optionally rerank with a cross-encoder.

The combination step is where the choices live. Three common approaches:

RRF (Reciprocal Rank Fusion)

Simplest and surprisingly effective:

for each doc d in either result list:
  score(d) = 1/(k + rank_bm25(d)) + 1/(k + rank_vector(d))

Where k is a constant (60 is the standard). Documents ranking high in either list get high scores; documents ranking high in both get the highest. RRF doesn't care about absolute scores from each system, only relative ranks. This is robust because BM25 scores and cosine similarities aren't directly comparable.

This is the default. Use this unless you have a reason not to.

Weighted score fusion

Normalize scores from both systems, then take a weighted sum:

final = alpha * normalized_bm25 + (1 - alpha) * normalized_vector

Gives more control. You can tune alpha for your domain (more weight on lexical for legal/medical, more on semantic for support chat). Requires you to actually do hyperparameter search, which is why most people just use RRF.

Cross-encoder reranking

The expensive but most accurate option:

  1. Get top 50 from BM25 + top 50 from vector, dedup.
  2. Run all ~100 candidates through a cross-encoder (e.g. BAAI/bge-reranker-v2-m3, Cohere Rerank, Jina Reranker v2).
  3. Take the cross-encoder's top 10.

Cross-encoders compare query and document jointly with attention, so they catch nuances that bi-encoder embeddings miss. They're slower (run inference on every (query, doc) pair) but produce noticeably better top-K.

In 2026, this is what serious systems use: BM25 + vector candidates → reranker → final K.

Implementation in 2026

Three paths depending on your stack:

Path 1: Postgres with pgvector

Your data is already in Postgres. Add pg_trgm for trigram-based BM25-ish search and pgvector for vectors:

-- BM25-ish via tsvector + tsrank
SELECT id, ts_rank(content_tsv, plainto_tsquery('refund policy')) AS bm_score
FROM docs
ORDER BY bm_score DESC LIMIT 50;

-- Vector
SELECT id, 1 - (embedding <=> $1) AS vec_score
FROM docs
ORDER BY embedding <=> $1 LIMIT 50;

-- Combine in app code with RRF

For production-grade BM25 in Postgres, use the paradedb extension (released 2024) which provides true BM25 in SQL. Cleanest single-database hybrid setup in 2026.

Path 2: Elasticsearch / OpenSearch

Elasticsearch has had dense_vector field type since 2022 and full BM25 forever. Run two queries (text + kNN), combine with RRF or rank block:

{
  "retriever": {
    "rrf": {
      "retrievers": [
        { "standard": { "query": { "match": { "content": "refund policy" } } } },
        { "knn": { "field": "embedding", "query_vector": [...], "k": 50 } }
      ]
    }
  }
}

Native RRF was added to ES in 8.8. This is the cleanest setup if you already use Elasticsearch.

Path 3: Vector DB + separate BM25 service

Using Pinecone, Qdrant, Weaviate, or LanceDB? Most have built-in BM25 / sparse vector support now:

  • Qdrant. Native sparse vectors + BM25 built in 2024.
  • Weaviate. hybrid query mode native, alpha parameter for tuning.
  • Pinecone. Sparse-dense hybrid with their splade-style sparse vectors.
  • LanceDB. FTS + vector with built-in fusion.

If you're not on Postgres or Elastic, pick one of these.

Tuning advice

Once hybrid is wired up:

  • Run your eval set (you have one, right? See the RAG evaluation post). Compare hybrid recall@K vs pure vector. Hybrid usually beats pure vector by 5-15% recall on real queries, more on keyword-heavy queries.
  • Don't over-engineer alpha. RRF with k=60 is a solid baseline. Only tune alpha if your domain is heavily skewed (legal docs need more lexical, casual chat needs more semantic).
  • Add a reranker once retrieval looks decent. Reranking compounds gains: hybrid + reranker is typically 20-30% better than pure vector alone.
  • Watch latency. Two retrievals + rerank = 200-400ms. If users feel it, parallelize the BM25 and vector calls (they're independent), and consider serving the reranker on GPU.

When NOT to bother

  • Tiny corpus (<1000 docs). Pure vector is fine. The recall gap is noise.
  • Your queries are all natural language paraphrases. Pure semantic actually wins here. BM25 doesn't help if users never type keywords.
  • You haven't built an eval set. Do that first. Without it you can't tell if hybrid actually helped.

A practical default

If you're starting fresh and want one decision: use Qdrant + Cohere Rerank in 2026. Qdrant gives you native hybrid, fast, easy to operate. Cohere Rerank is the highest-quality reranker available as a hosted API ($0.002 per 1000 docs reranked, basically free). Total infrastructure: one Qdrant cluster, one API key, three lines of glue code. This setup beats 95% of bespoke retrieval pipelines.

If you can't use external APIs, swap Cohere for BAAI/bge-reranker-v2-m3 self-hosted on a single GPU.

Further reading

  • Pretrained Transformers for Text Ranking: BERT and Beyond — the dense retrieval canon.
  • BEIR benchmark paper — cross-domain retrieval evaluation, validates hybrid superiority.
  • Vespa / Pinecone / Qdrant blog posts on production hybrid setups.
  • Look up: SPLADE, ColBERT, BM25 vs SPLADE, query expansion.

Last updated: 2026-04-29

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more