Semantic search uses AI embeddings to find information based on meaning and intent rather than exact keyword matches. A semantic search system understands that 'heart attack treatment' and 'myocardial infarction therapy' are asking the same thing, and can return relevant results even when none of the query words appear in the document.
Keyword search vs semantic search
The fundamental difference: keyword search finds documents that contain your words; semantic search finds documents that mean what you mean.
| Dimension | Keyword / Lexical (BM25) | Semantic (Embedding-based) |
|---|---|---|
| How it works | Term frequency × inverse document frequency scoring | Encode query + docs to vectors; cosine similarity |
| Handles synonyms? | No — "heart attack" misses "myocardial infarction" | Yes — same region in embedding space |
| Handles paraphrase? | No — different words = no match | Yes — meaning preserved in embedding |
| Handles typos? | Partially (fuzzy matching add-ons) | Yes — nearby spelling = similar embedding |
| Speed | Very fast — inverted index lookup | Fast — ANN search, ~1–10ms |
| Interpretable? | Yes — exact term matches visible | Less so — similarity score only |
| When it fails | Vocabulary mismatch, paraphrase, concept queries | Very specific technical terms, very short docs |
| Best for | Exact product names, codes, legal terms | General Q&A, intent search, FAQ matching |
Hybrid search always wins
Production search systems (Google, Elasticsearch, Weaviate, pgvector) consistently find that hybrid search — BM25 score + semantic similarity score, combined with a cross-encoder re-ranker — outperforms either alone. The intuition: BM25 catches exact technical terms that embeddings blur; semantic search catches synonyms and intent that BM25 misses. Use Reciprocal Rank Fusion (RRF) to merge the two ranked lists.
Bi-encoder vs cross-encoder architecture
The core tension in semantic search: accuracy vs speed. Bi-encoders are fast but less accurate; cross-encoders are slow but very accurate. The standard solution: use both in a two-stage pipeline.
| Property | Bi-encoder | Cross-encoder |
|---|---|---|
| How it works | Encode query and doc separately → fixed vectors → cosine similarity | Concatenate query + doc → run full model → single relevance score |
| Token interaction | None — query and doc never attend to each other | Full cross-attention between every query and doc token |
| Indexing | Pre-compute doc embeddings offline; query embedding at runtime | Cannot pre-compute — must run at query time with each doc |
| Latency | ~1–10ms for ANN search over millions of docs | ~100–500ms per query for top-100 docs |
| Quality | Good — limited by single-vector bottleneck | Best — tokens attend directly to each other |
| Scale | Billions of docs — ANN index handles it | Only feasible for small candidate sets (~100–500 docs) |
| Example models | E5, BGE, Voyage, text-embedding-3-large | MS-MARCO cross-encoder, Cohere Rerank, bge-reranker |
The retrieve-then-rerank pipeline
Industry standard: (1) Bi-encoder retrieves top-100 candidates from millions of docs in ~10ms. (2) Cross-encoder re-ranks those 100 to produce final top-10 in ~200ms. Total latency ~210ms — the quality of a cross-encoder at the scale of a bi-encoder. Cohere Rerank, Jina Reranker, and BGE Reranker are popular choices for the reranking step.
Two-stage retrieval pipeline with SentenceTransformers
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
# Stage 1: Bi-encoder retrieval
bi_encoder = SentenceTransformer('intfloat/e5-large-v2')
docs = ["heart attack symptoms", "myocardial infarction treatment", "chest pain causes", ...]
# Pre-compute document embeddings (done once, cached)
doc_embeddings = bi_encoder.encode(["passage: " + d for d in docs], normalize_embeddings=True)
query = "query: what causes heart attacks"
query_emb = bi_encoder.encode([query], normalize_embeddings=True)
# Retrieve top-100 by cosine similarity
scores = (query_emb @ doc_embeddings.T)[0]
top100_indices = np.argsort(scores)[::-1][:100]
top100_docs = [docs[i] for i in top100_indices]
# Stage 2: Cross-encoder reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
pairs = [("what causes heart attacks", doc) for doc in top100_docs]
rerank_scores = cross_encoder.predict(pairs)
# Final top-10
top10 = sorted(zip(rerank_scores, top100_docs), reverse=True)[:10]
for score, doc in top10:
print(f"{score:.3f} {doc}")Dense Passage Retrieval (DPR)
DPR (Karpukhin et al., Facebook AI, 2020) was the breakthrough paper showing learned dense embeddings can outperform BM25 for open-domain question answering — launching the modern semantic search era.
DPR contrastive loss: maximize similarity between (query, positive passage) pairs while minimizing similarity to negative passages. sim(q, p) = dot product of BERT encodings. In-batch negatives: other passages in the batch serve as easy negatives; BM25 hard negatives added for harder training signal.
| Aspect | Detail |
|---|---|
| Architecture | Two independent BERT encoders — one for queries, one for passages |
| Training data | Natural Questions + TriviaQA — Wikipedia passages, annotated with gold passages |
| Negative mining | In-batch negatives + BM25 hard negatives (passages that contain query terms but aren't the answer) |
| Result on NQ | Top-20 retrieval accuracy: DPR 79.4% vs BM25 59.1% — +20 points |
| Legacy | Training paradigm (contrastive, hard negatives) adopted by E5, BGE, GTE, Voyage, text-embedding-3 |
Hard negatives are the key
The biggest DPR insight: easy negatives (random passages) make the model lazy — it only needs to avoid obviously unrelated content. Hard negatives (passages that look relevant but aren't) force the encoder to develop a precise semantic understanding. Modern embedding models (E5, BGE, GTE) spend significant effort on hard negative mining strategies — using BM25, a weaker model, or mined adversarial examples to generate challenging training pairs.
Neural semantic search in production
A production semantic search system has three phases: offline indexing, online retrieval, and optional reranking. Each phase has distinct engineering tradeoffs.
End-to-end semantic search with pgvector (PostgreSQL)
import psycopg2
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/e5-large-v2')
# ── INDEXING PHASE (run once / on updates) ─────────────────────────
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute(
"CREATE TABLE IF NOT EXISTS documents "
"(id SERIAL PRIMARY KEY, content TEXT, embedding vector(1024))"
)
cur.execute(
"CREATE INDEX IF NOT EXISTS doc_idx ON documents "
"USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64)"
)
docs = ["Introduction to neural networks...", "Attention mechanism explained..."]
embeddings = model.encode(["passage: " + d for d in docs], normalize_embeddings=True)
for doc, emb in zip(docs, embeddings):
cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)",
(doc, emb.tolist()))
conn.commit()
# ── RETRIEVAL PHASE (per query) ─────────────────────────────────────
query = "query: how does self-attention work"
q_emb = model.encode([query], normalize_embeddings=True)[0]
cur.execute(
"SELECT content, 1 - (embedding <=> %s::vector) AS sim "
"FROM documents ORDER BY embedding <=> %s::vector LIMIT 10",
(q_emb.tolist(), q_emb.tolist())
)
for text, sim in cur.fetchall():
print(f"{sim:.3f} {text[:80]}")| Vector DB | Best for | Hosting | ANN algorithm |
|---|---|---|---|
| pgvector | Existing PostgreSQL infra; moderate scale (<50M vecs) | Self-hosted or Supabase | IVFFlat or HNSW |
| Pinecone | Managed, zero ops, horizontal scale | Fully managed SaaS | Proprietary (HNSW-based) |
| Weaviate | Hybrid search built-in; multi-modal | Self-hosted or managed | HNSW |
| Qdrant | High performance, Rust core, filtering-heavy workloads | Self-hosted or managed | HNSW |
| Chroma | Local dev, prototyping, embedded use | Self-hosted (embedded lib) | HNSW (via hnswlib) |
Chunking strategy matters most
The single biggest factor in semantic search quality is not the model — it's how you chunk documents. Too large: retrieved chunks contain mostly irrelevant content. Too small: a single concept is split across chunks, losing context. Rule of thumb: 256–512 tokens with 50-token overlap for general text. Use semantic chunking (split at paragraph/section boundaries) when document structure is available.
ColBERT: late interaction for efficiency
ColBERT (Khattab & Zaharia, Stanford, 2020) bridges the gap between bi-encoders (fast, less accurate) and cross-encoders (slow, most accurate) via late interaction — token-level similarity without full cross-attention.
ColBERT MaxSim scoring: for each query token embedding, find the most similar document token embedding (MaxSim). Sum these per-query-token max scores to get the final relevance score. Documents are pre-encoded to token vectors and stored in compressed form.
| Architecture | Storage | Query latency | Quality | Scale |
|---|---|---|---|---|
| Bi-encoder (single vector) | 1 vector per doc | <5ms ANN | Good | Billions of docs |
| ColBERT (late interaction) | N token vectors per doc (~128 tokens) | 20–50ms | Near cross-encoder | 100M+ docs with PLAID |
| Cross-encoder (full attention) | None (compute at query time) | 100–500ms for top-100 | Best | Only reranking, not retrieval |
ColBERT v2 + PLAID
ColBERT v2 (2022) added residual compression — reducing storage by 6–10× with minimal quality loss. PLAID (2022) adds a fast candidate generation phase before full ColBERT scoring, enabling sub-100ms retrieval over 100M+ documents. RAGatouille (Python library) makes ColBERT v2 accessible without custom infrastructure — one-line indexing and search.
Practice questions
- What is the difference between semantic search and lexical search in terms of query handling? (Answer: Lexical search (BM25): tokenises query and document, computes term frequency and inverse document frequency scores. Handles: exact term matching, rare technical terms, product codes, proper nouns. Fails on: paraphrase, synonym, cross-language queries. Semantic search (bi-encoder): converts query and document to dense embeddings, retrieves by vector similarity. Handles: paraphrase ('cheap car' matches 'affordable vehicle'), intent ('how to fix' matches troubleshooting docs). Fails on: exact code/ID matching, very rare domain terms not in training data.)
- What is a bi-encoder vs cross-encoder for semantic search and when do you use each? (Answer: Bi-encoder: encode query and document INDEPENDENTLY → embeddings stored offline. Query at search time: encode query, ANN search. Very fast (O(1) per query). Good recall but not optimal precision. Cross-encoder: jointly encode (query, document) pair → single relevance score. Much more accurate (attends across both). Cannot pre-encode — must run for every (query, document) pair at query time. Too slow for first-stage retrieval. Architecture: bi-encoder for retrieval (recall), cross-encoder for re-ranking top-k results (precision).)
- What is BEIR (Benchmarking IR) and what has it revealed about semantic search generalisation? (Answer: BEIR (Thakur et al. 2021): 18 diverse information retrieval benchmarks (MSMARCO, TREC-COVID, NQ, ArguAna, etc.) covering different domains and query types. Key finding: models fine-tuned on one retrieval dataset (MSMARCO) significantly underperform on other domains — generalisation is poor. Dense retrieval often underperforms BM25 on out-of-domain data. Hybrid BM25 + dense retrieval consistently outperforms either alone. Conclusion: domain-specific fine-tuning or robust generalisation training is essential for production semantic search.)
- What is late interaction semantic search (ColBERT) and when is it preferred? (Answer: ColBERT: encode query and document separately into per-token embeddings (not pooled sentence vectors). Scoring: sum of max cosine similarities between each query token and its best-matching document token (MaxSim). More accurate than bi-encoder (richer interaction) while allowing offline document indexing (unlike cross-encoder). Storage cost: all document token embeddings stored (much larger than bi-encoder). Use when: bi-encoder recall is insufficient, cross-encoder is too slow, and storage cost is acceptable.)
- What is approximate nearest neighbour (ANN) indexing and which algorithm does FAISS use by default? (Answer: ANN trades exact nearest neighbour accuracy for massive speed gains. FAISS (Facebook AI Similarity Search) supports multiple index types: IndexFlatL2/IP (exact, brute force — baseline). IndexIVFFlat: inverted file index with cluster-based search — 10–100× faster, ~95% recall with nprobe=64. IndexHNSW: graph-based, excellent recall and speed balance, memory-intensive. IndexIVFPQ: inverted file + product quantisation for memory compression. Production recommendation: IVF_HNSW for high-recall/low-latency; IVFPQ for memory-constrained deployments.)