Why This Decision Matters More Than You Think

Search is deceptively complex. You can stand up Elasticsearch in an afternoon and have something that works. But whether it works well — whether it surfaces the right document when a user asks “how do I reset my subscription?” instead of typing “subscription reset steps” — is an entirely different engineering problem.

The two dominant retrieval paradigms today are BM25 (term-based statistical ranking) and Vector Search (semantic embedding-based retrieval). Both are mature. Both are production-proven. And increasingly, you need to understand not just how each works, but why one fails where the other succeeds — and how to combine them effectively.

This is especially relevant if you are building any system that sits on top of an LLM: RAG pipelines, enterprise knowledge bases, semantic document stores, or AI-assisted search. Getting retrieval wrong means the model gets bad context, and bad context means bad answers regardless of how capable the model is.


1. BM25 — The Probabilistic Workhorse

BM25 (Best Match 25) is a term-frequency ranking function that has been the backbone of search engines since the mid-1990s. It is still the default ranking algorithm in Elasticsearch and OpenSearch today.

How It Works

BM25 scores a document against a query based on three factors:

Term Frequency (TF): How often does the query term appear in the document? Relevance increases with frequency, but with diminishing returns — a document mentioning “authentication” 20 times is not 20x more relevant than one mentioning it once.

Inverse Document Frequency (IDF): How rare is the term across the entire corpus? Terms that appear in nearly every document (“the”, “is”, “a”) carry little discriminating weight. Terms that appear in few documents (“OAuth2”, “mTLS”) carry high weight.

Document Length Normalization: Longer documents have more opportunity to contain any given term. BM25 penalizes long documents to prevent them from dominating results purely by volume.

The score formula:

Score(D, Q) = Σ IDF(qᵢ) · [ f(qᵢ, D) · (k1 + 1) ] / [ f(qᵢ, D) + k1 · (1 - b + b · |D|/avgdl) ]

Where k1 controls term frequency saturation (typically 1.2–2.0) and b controls length normalization (typically 0.75). These are tunable — and your corpus characteristics should drive that tuning.

Where BM25 Excels

  • Exact and near-exact keyword matching — product SKUs, error codes, legal citations, medical terminology, CLI flags
  • High-precision queries — when the user knows the exact terminology of the domain
  • Transparent, debuggable ranking — you can explain exactly why a document ranked where it did
  • Low infrastructure cost — no GPU, no embedding model, no vector index; runs on commodity hardware
  • Latency — sub-millisecond at scale on inverted indexes; Elasticsearch handles billions of documents efficiently

Where BM25 Breaks Down

  • Vocabulary mismatch — the user says “cancel membership”, the document says “terminate subscription”. BM25 scores this as zero overlap.
  • Semantic intent — “I can’t log in” vs. “authentication failure” — same problem, zero term overlap
  • Multi-language queries — a query in French will not match an equivalent document in English without explicit translation
  • Conceptual similarity — “affordable laptop” will not match a document titled “budget-friendly notebook computer”

BM25 is fundamentally a bag-of-words model. It has no understanding of meaning, context, or relationships between concepts. It is a very well-engineered frequency counter.


2. Vector Search — Semantic Retrieval via Embeddings

Vector search transforms text into dense numerical vectors in a high-dimensional space (typically 384 to 3072 dimensions, depending on the model), where semantically similar content is geometrically close. Retrieval becomes a nearest-neighbor search in that space.

How It Works

  1. At index time: every document (or chunk) is passed through an embedding model — a transformer-based neural network — that produces a fixed-length vector representation.
  2. At query time: the query is embedded using the same model, producing a query vector.
  3. Retrieval: the system finds the top-K document vectors closest to the query vector, typically via cosine similarity or dot product.
similarity(A, B) = (A · B) / (||A|| · ||B||)

Exact nearest-neighbor search is O(n) and impractical at scale. Production systems use Approximate Nearest Neighbor (ANN) algorithms:

  • HNSW (Hierarchical Navigable Small World) — graph-based, high recall, high memory usage. Default in most vector DBs.
  • IVF (Inverted File Index) — cluster-based, lower memory, slightly lower recall. Common in Faiss.
  • PQ (Product Quantization) — compresses vectors for memory-constrained environments, at the cost of recall.

The Embedding Model Is Everything

The quality of your vector search is entirely determined by your embedding model. Key considerations:

Factor What to Evaluate
Domain fit Was the model trained on data similar to your corpus? General-purpose models underperform on specialized domains (legal, medical, code).
Dimensionality Higher dimensions = more expressive but slower and more memory-intensive.
Max token length Most models cap at 512 or 8192 tokens. Chunking strategy must align with this.
Multilingual support Models like multilingual-e5 handle cross-language retrieval natively.
Instruction-tuned Models like e5-mistral or text-embedding-3-large accept task-specific prefixes to improve retrieval quality.

Where Vector Search Excels

  • Semantic equivalence — “cancel membership” correctly matches “terminate subscription”
  • Intent-based queries — natural language questions map to relevant answers even without term overlap
  • Cross-lingual retrieval — multilingual embedding models handle queries and documents in different languages
  • Fuzzy conceptual matching — “fast database for reads” surfaces documents about read replicas, caching, and CQRS
  • RAG pipelines — LLMs generate natural language questions; vector search is the correct retrieval layer

Where Vector Search Breaks Down

  • Exact term matching — searching for NullPointerException or ORDER-2847-XZ — an embedding model may dilute the specificity of these tokens
  • Keyword-heavy technical queries — CLI flags, config keys, API parameter names
  • Hallucinated semantic proximity — unrelated documents can land near your query vector if the model does not understand your domain
  • Infrastructure cost — embedding inference requires compute (GPU for large models); vector indexes consume significant RAM (HNSW keeps the full index in memory)
  • Staleness — re-embedding is required when documents change; incremental updates to HNSW indexes are expensive

3. Head-to-Head: When Each Wins

Scenario BM25 Vector Search Why
Search for error code ERR_SSL_PROTOCOL_ERROR Exact token match; embedding dilutes specificity
“How do I improve query performance?” Matches “optimization”, “indexing”, “slow queries” conceptually
Legal document search by statute number Precise citation matching
Customer support: “my card keeps getting declined” Maps to payment failure, fraud detection, card validation content
Product catalog search by SKU Exact alphanumeric match
RAG knowledge base for an LLM assistant LLM-generated queries are natural language, not keyword queries
Multi-language corpus, single-language query Multilingual embedding models handle this natively
Faceted/filtered search with field boosting ⚠️ BM25 has native field-level scoring; vector filters are post-hoc

4. Hybrid Search — The Production Reality

In most real-world systems, neither BM25 nor vector search alone is sufficient. The answer is hybrid retrieval: running both pipelines in parallel and combining their scores.

Reciprocal Rank Fusion (RRF)

RRF is the standard technique for merging ranked lists from multiple retrievers without requiring score normalization:

RRF_score(d) = Σ 1 / (k + rank(d))

Where k is a smoothing constant (typically 60) and rank(d) is the document’s rank in each individual result list. Documents that rank well in both lists receive a substantial boost. Documents strong in only one list still surface, but below the intersection.

RRF is rank-based, not score-based — which means you do not need to normalize cosine similarity ([-1,1]) against BM25 scores ([0, ∞]) before merging. This makes it operationally clean and robust.

Hybrid Architecture Pattern

Query
  │
  ├──► BM25 Retriever (Elasticsearch / OpenSearch)
  │         └── Top-K candidates
  │
  └──► Vector Retriever (Pinecone / Weaviate / pgvector / Qdrant)
            └── Top-K candidates
                    │
                    ▼
            RRF Merge (or weighted score fusion)
                    │
                    ▼
            Re-ranker (optional — cross-encoder for precision)
                    │
                    ▼
            Final Top-N Results → LLM Context / Response

The Re-ranker Layer

After initial retrieval (the recall stage), a cross-encoder re-ranker can significantly improve precision. Unlike bi-encoders (which encode query and document independently), cross-encoders process the query-document pair jointly, producing a much more accurate relevance score — at higher compute cost.

This is the pattern:

  • Retrieval (high recall, lower precision): BM25 + vector search, top 50–100 candidates
  • Re-ranking (high precision, higher cost): cross-encoder on the top candidates, select top 5–10
  • Generation: pass final candidates as context to the LLM

Models like cross-encoder/ms-marco-MiniLM-L-6-v2 or Cohere’s Rerank API work well here.


5. Tooling Landscape

Pure Vector Stores

Tool Best For Notes
Pinecone Managed, production-scale vector search Fully managed, excellent performance, no BM25 native
Qdrant Self-hosted or cloud, high-performance Built-in payload filtering, sparse vector support (hybrid-ready)
Weaviate Hybrid search out of the box BM25 + vector natively; supports GraphQL
Chroma Local dev, prototyping Not production-grade at scale
Milvus High-volume, cloud-native Kubernetes-native, supports multiple ANN indexes

Hybrid-Native (BM25 + Vector)

Tool Notes
Elasticsearch 8.x ELSER (learned sparse retrieval) + dense vector; native hybrid via knn + query
OpenSearch Neural search plugin; hybrid query support
Weaviate Native BM25 + vector, RRF built-in
Azure AI Search Hybrid retrieval with semantic re-ranking; strong Azure ecosystem integration

Relational + Vector

Tool Notes
pgvector (PostgreSQL) vector type with HNSW/IVF index; combine with full-text search in a single query
SQLite-vec Lightweight, embedded; good for local or edge deployments

Embedding Model Providers

Provider Models Best For
OpenAI text-embedding-3-small, text-embedding-3-large General-purpose, strong baseline
Cohere embed-v3 Multilingual, strong retrieval performance
Hugging Face e5-large-v2, bge-large-en-v1.5 Self-hosted, open weights
Voyage AI voyage-2, voyage-code-2 Domain-specific (code, legal, finance)

6. Chunking Strategy — The Underrated Dependency

Vector search quality is highly sensitive to how you split documents before embedding. This is frequently where RAG pipelines fail silently.

Fixed-size chunking (e.g., 512 tokens with 50-token overlap) is simple but often cuts across sentence or paragraph boundaries, degrading embedding quality.

Semantic chunking splits on natural boundaries — paragraphs, sections, sentences — preserving coherent units of meaning. Higher quality, more complex to implement.

Hierarchical chunking maintains both parent (full section) and child (sentence-level) chunks. Retrieve at the child level for precision, return the parent chunk as context for the LLM. This is the pattern used in LlamaIndex’s ParentDocumentRetriever and LangChain’s equivalent.

The rule of thumb: embed at the granularity you want to retrieve, return at the granularity that gives the LLM sufficient context.


7. Decision Framework

Run through this with your team before committing to a retrieval architecture:

Question If Yes →
Do users search with exact terms, IDs, or codes? Include BM25
Do users ask natural language questions? Include Vector Search
Is the corpus domain-specific (legal, medical, code)? Use a domain-specific embedding model
Do you need sub-10ms query latency at scale? Lean toward BM25 or a well-indexed ANN (HNSW)
Are you building a RAG pipeline for an LLM? Hybrid retrieval + re-ranker
Does the corpus change frequently? Factor in re-embedding cost; consider incremental update support
Multi-language corpus or user base? Multilingual embedding model (Cohere embed-v3, multilingual-e5)
Do you need to explain why a result ranked? BM25 is transparent; vector similarity is a black box

8. Common Architecture Mistakes

Mistake 1: Using Vector Search Alone for RAG

Vector search is semantically powerful but recall is not guaranteed. BM25 catches exact-match cases that vector search misses — especially for technical queries. A RAG pipeline without BM25 in the retrieval stack will have systematic blind spots.

Mistake 2: Ignoring Chunk Boundaries

Embedding a 5000-token document as a single vector averages out the signal and destroys retrieval specificity. Chunk. Always. And overlap chunks to avoid losing context at boundaries.

Mistake 3: Using a General-Purpose Embedding Model on a Specialized Corpus

text-embedding-3-large is an excellent general-purpose model. It is a mediocre model for retrieving surgical procedure documentation or tax code annotations. Evaluate domain-specific models and fine-tune if your corpus has unique vocabulary.

Mistake 4: Skipping the Re-ranker

Initial retrieval optimizes for recall. The re-ranker optimizes for precision. Sending the top-20 BM25+vector results directly to an LLM without re-ranking is a missed opportunity to dramatically improve answer quality — especially when context window limits mean you can only pass top-5.

Mistake 5: Not Evaluating Retrieval Independently

Teams frequently evaluate the end-to-end RAG pipeline (retrieval + generation together) and attribute poor results to the LLM. Often, it is retrieval that is failing. Measure retrieval quality independently using metrics like Recall@K, MRR (Mean Reciprocal Rank), and NDCG (Normalized Discounted Cumulative Gain) before debugging the generation layer.


Closing: It’s Not Either/Or

BM25 and vector search are not competing technologies. They are complementary retrieval strategies with different failure modes. BM25 is fast, transparent, and exact. Vector search is semantic, flexible, and powerful for natural language. In most production systems — especially any system with an LLM in the loop — you want both.

The winning architecture for most teams:

  1. Hybrid retrieval — BM25 + vector search in parallel
  2. RRF merge — rank fusion without score normalization headaches
  3. Cross-encoder re-ranker — precision pass on the merged candidate set
  4. Evaluate retrieval independently — Recall@K and MRR before you blame the model

📌 Key Takeaway

If you are building any AI-powered search or RAG system and you have not explicitly decided on your retrieval strategy — BM25, vector, or hybrid — you have implicitly made a poor decision. Get retrieval right first. Everything downstream depends on it.


Search Architecture  •  Information Retrieval  •  March 2026

Further Reading: Robertson & Zaragoza — The Probabilistic Relevance Framework: BM25 and Beyond (2009), Karpukhin et al. — Dense Passage Retrieval (2020), Cormack et al. — Reciprocal Rank Fusion (2009), Lewis et al. — Retrieval-Augmented Generation for NLP (2020)