Hybrid Search and Re-ranking in Production RAG 2026: BM25, Dense, Cross-encoders, Fusion

Q: Why does dense-vector-only retrieval consistently fail on exact-term queries, and is the fix to use a better embedding model?

The failure mode is structural rather than configurational, and a better embedding model does not fix it. A bi-encoder embedding model maps a paragraph of text into a fixed-size vector (1024-3072 dimensions are typical). Information has to be lost in that projection — a 200-word paragraph contains substantially more information than a 1024-dimensional vector can carry. The training objective of the embedding model is contrastive: pairs of semantically similar texts are pulled close in the embedding space, pairs of dissimilar texts are pushed apart. The training data overwhelmingly consists of paraphrase pairs, query-answer pairs, and translation pairs — almost none of it consists of "the exact phrase X appears in this document and not in that document". The model has no incentive to preserve the distinctness of rare technical phrases because preserving every rare phrase distinctly would defeat the dimensionality reduction the bi-encoder exists to perform. When the user queries for "dead-letter queue threshold", the phrase has been averaged into the paragraph-level meaning of every chunk that mentions it; the conceptual neighbourhood that the embedding captures includes documents about retry policies, exponential backoff, error handling, and queue depth metrics — all reasonable conceptual matches, none of which contain the exact phrase. Switching from BGE to E5 to Voyage to OpenAI text-embedding-3-large moves the recall by single-digit percentages on conceptual queries and barely moves it on exact-term queries because the failure mode is shared across all bi-encoders. The fix is to add a sparse retriever (BM25, SPLADE, uniCOIL) running in parallel and fuse the rankings — the sparse retriever has lexical precision the dense retriever cannot have, the dense retriever has conceptual generalisation the sparse retriever cannot have, and the union of the two retrievers covers the query distribution far better than either alone.