What is practice questions?

Semantic Search: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/semantic-search

Semantic Search

Semantic search uses AI embeddings to find information based on meaning and intent rather than exact keyword matches. A semantic search system understands that 'heart attack treatment' and 'myocardial infarction therapy' are asking the same thing, and can return relevant results even when none of the query words appear in the document.

Finding meaning, not just matching keywords.

Category: Natural Language Processing

Keyword search vs semantic search

The fundamental difference: keyword search finds documents that contain your words; semantic search finds documents that mean what you mean.

Dimension	Keyword / Lexical (BM25)	Semantic (Embedding-based)
How it works	Term frequency × inverse document frequency scoring	Encode query + docs to vectors; cosine similarity
Handles synonyms?	No — "heart attack" misses "myocardial infarction"	Yes — same region in embedding space
Handles paraphrase?	No — different words = no match	Yes — meaning preserved in embedding
Handles typos?	Partially (fuzzy matching add-ons)	Yes — nearby spelling = similar embedding
Speed	Very fast — inverted index lookup	Fast — ANN search, ~1–10ms
Interpretable?	Yes — exact term matches visible	Less so — similarity score only
When it fails	Vocabulary mismatch, paraphrase, concept queries	Very specific technical terms, very short docs
Best for	Exact product names, codes, legal terms	General Q&A, intent search, FAQ matching

Hybrid search always wins: Production search systems (Google, Elasticsearch, Weaviate, pgvector) consistently find that hybrid search — BM25 score + semantic similarity score, combined with a cross-encoder re-ranker — outperforms either alone. The intuition: BM25 catches exact technical terms that embeddings blur; semantic search catches synonyms and intent that BM25 misses. Use Reciprocal Rank Fusion (RRF) to merge the two ranked lists.

Bi-encoder vs cross-encoder architecture

The core tension in semantic search: accuracy vs speed. Bi-encoders are fast but less accurate; cross-encoders are slow but very accurate. The standard solution: use both in a two-stage pipeline.

Property	Bi-encoder	Cross-encoder
How it works	Encode query and doc separately → fixed vectors → cosine similarity	Concatenate query + doc → run full model → single relevance score
Token interaction	None — query and doc never attend to each other	Full cross-attention between every query and doc token
Indexing	Pre-compute doc embeddings offline; query embedding at runtime	Cannot pre-compute — must run at query time with each doc
Latency	~1–10ms for ANN search over millions of docs	~100–500ms per query for top-100 docs
Quality	Good — limited by single-vector bottleneck	Best — tokens attend directly to each other
Scale	Billions of docs — ANN index handles it	Only feasible for small candidate sets (~100–500 docs)
Example models	E5, BGE, Voyage, text-embedding-3-large	MS-MARCO cross-encoder, Cohere Rerank, bge-reranker

The retrieve-then-rerank pipeline: Industry standard: (1) Bi-encoder retrieves top-100 candidates from millions of docs in ~10ms. (2) Cross-encoder re-ranks those 100 to produce final top-10 in ~200ms. Total latency ~210ms — the quality of a cross-encoder at the scale of a bi-encoder. Cohere Rerank, Jina Reranker, and BGE Reranker are popular choices for the reranking step.

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Stage 1: Bi-encoder retrieval
bi_encoder = SentenceTransformer('intfloat/e5-large-v2')
docs = ["heart attack symptoms", "myocardial infarction treatment", "chest pain causes", ...]

# Pre-compute document embeddings (done once, cached)
doc_embeddings = bi_encoder.encode(["passage: " + d for d in docs], normalize_embeddings=True)

query = "query: what causes heart attacks"
query_emb = bi_encoder.encode([query], normalize_embeddings=True)

# Retrieve top-100 by cosine similarity
scores = (query_emb @ doc_embeddings.T)[0]
top100_indices = np.argsort(scores)[::-1][:100]
top100_docs = [docs[i] for i in top100_indices]

# Stage 2: Cross-encoder reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
pairs = [("what causes heart attacks", doc) for doc in top100_docs]
rerank_scores = cross_encoder.predict(pairs)

# Final top-10
top10 = sorted(zip(rerank_scores, top100_docs), reverse=True)[:10]
for score, doc in top10:
    print(f"{score:.3f}  {doc}")

Dense Passage Retrieval (DPR)

DPR (Karpukhin et al., Facebook AI, 2020) was the breakthrough paper showing learned dense embeddings can outperform BM25 for open-domain question answering — launching the modern semantic search era.

\mathcal{L} = -\log \frac{e^{\text{sim}(q, p^+)}}{e^{\text{sim}(q, p^+)} + \sum_{j=1}^{n} e^{\text{sim}(q, p^-_j)}}

Aspect	Detail
Architecture	Two independent BERT encoders — one for queries, one for passages
Training data	Natural Questions + TriviaQA — Wikipedia passages, annotated with gold passages
Negative mining	In-batch negatives + BM25 hard negatives (passages that contain query terms but aren't the answer)
Result on NQ	Top-20 retrieval accuracy: DPR 79.4% vs BM25 59.1% — +20 points
Legacy	Training paradigm (contrastive, hard negatives) adopted by E5, BGE, GTE, Voyage, text-embedding-3

Hard negatives are the key: The biggest DPR insight: easy negatives (random passages) make the model lazy — it only needs to avoid obviously unrelated content. Hard negatives (passages that look relevant but aren't) force the encoder to develop a precise semantic understanding. Modern embedding models (E5, BGE, GTE) spend significant effort on hard negative mining strategies — using BM25, a weaker model, or mined adversarial examples to generate challenging training pairs.

Neural semantic search in production

A production semantic search system has three phases: offline indexing, online retrieval, and optional reranking. Each phase has distinct engineering tradeoffs.

import psycopg2
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/e5-large-v2')

# ── INDEXING PHASE (run once / on updates) ─────────────────────────
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute(
    "CREATE TABLE IF NOT EXISTS documents "
    "(id SERIAL PRIMARY KEY, content TEXT, embedding vector(1024))"
)
cur.execute(
    "CREATE INDEX IF NOT EXISTS doc_idx ON documents "
    "USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64)"
)

docs = ["Introduction to neural networks...", "Attention mechanism explained..."]
embeddings = model.encode(["passage: " + d for d in docs], normalize_embeddings=True)
for doc, emb in zip(docs, embeddings):
    cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)",
                (doc, emb.tolist()))
conn.commit()

# ── RETRIEVAL PHASE (per query) ─────────────────────────────────────
query = "query: how does self-attention work"
q_emb = model.encode([query], normalize_embeddings=True)[0]
cur.execute(
    "SELECT content, 1 - (embedding <=> %s::vector) AS sim "
    "FROM documents ORDER BY embedding <=> %s::vector LIMIT 10",
    (q_emb.tolist(), q_emb.tolist())
)
for text, sim in cur.fetchall():
    print(f"{sim:.3f}  {text[:80]}")

Vector DB	Best for	Hosting	ANN algorithm
pgvector	Existing PostgreSQL infra; moderate scale (<50M vecs)	Self-hosted or Supabase	IVFFlat or HNSW
Pinecone	Managed, zero ops, horizontal scale	Fully managed SaaS	Proprietary (HNSW-based)
Weaviate	Hybrid search built-in; multi-modal	Self-hosted or managed	HNSW
Qdrant	High performance, Rust core, filtering-heavy workloads	Self-hosted or managed	HNSW
Chroma	Local dev, prototyping, embedded use	Self-hosted (embedded lib)	HNSW (via hnswlib)

Chunking strategy matters most: The single biggest factor in semantic search quality is not the model — it's how you chunk documents. Too large: retrieved chunks contain mostly irrelevant content. Too small: a single concept is split across chunks, losing context. Rule of thumb: 256–512 tokens with 50-token overlap for general text. Use semantic chunking (split at paragraph/section boundaries) when document structure is available.

ColBERT: late interaction for efficiency

ColBERT (Khattab & Zaharia, Stanford, 2020) bridges the gap between bi-encoders (fast, less accurate) and cross-encoders (slow, most accurate) via late interaction — token-level similarity without full cross-attention.

\text{score}(q, d) = \sum_{i \in q} \max_{j \in d} E_q[i] \cdot E_d[j]^T

Architecture	Storage	Query latency	Quality	Scale
Bi-encoder (single vector)	1 vector per doc	<5ms ANN	Good	Billions of docs
ColBERT (late interaction)	N token vectors per doc (~128 tokens)	20–50ms	Near cross-encoder	100M+ docs with PLAID
Cross-encoder (full attention)	None (compute at query time)	100–500ms for top-100	Best	Only reranking, not retrieval

ColBERT v2 + PLAID: ColBERT v2 (2022) added residual compression — reducing storage by 6–10× with minimal quality loss. PLAID (2022) adds a fast candidate generation phase before full ColBERT scoring, enabling sub-100ms retrieval over 100M+ documents. RAGatouille (Python library) makes ColBERT v2 accessible without custom infrastructure — one-line indexing and search.

Practice questions

What is the difference between semantic search and lexical search in terms of query handling? (Answer: Lexical search (BM25): tokenises query and document, computes term frequency and inverse document frequency scores. Handles: exact term matching, rare technical terms, product codes, proper nouns. Fails on: paraphrase, synonym, cross-language queries. Semantic search (bi-encoder): converts query and document to dense embeddings, retrieves by vector similarity. Handles: paraphrase ('cheap car' matches 'affordable vehicle'), intent ('how to fix' matches troubleshooting docs). Fails on: exact code/ID matching, very rare domain terms not in training data.)
What is a bi-encoder vs cross-encoder for semantic search and when do you use each? (Answer: Bi-encoder: encode query and document INDEPENDENTLY → embeddings stored offline. Query at search time: encode query, ANN search. Very fast (O(1) per query). Good recall but not optimal precision. Cross-encoder: jointly encode (query, document) pair → single relevance score. Much more accurate (attends across both). Cannot pre-encode — must run for every (query, document) pair at query time. Too slow for first-stage retrieval. Architecture: bi-encoder for retrieval (recall), cross-encoder for re-ranking top-k results (precision).)
What is BEIR (Benchmarking IR) and what has it revealed about semantic search generalization? (Answer: BEIR (Thakur et al. 2021): 18 diverse information retrieval benchmarks (MSMARCO, TREC-COVID, NQ, ArguAna, etc.) covering different domains and query types. Key finding: models fine-tuned on one retrieval dataset (MSMARCO) significantly underperform on other domains — generalization is poor. Dense retrieval often underperforms BM25 on out-of-domain data. Hybrid BM25 + dense retrieval consistently outperforms either alone. Conclusion: domain-specific fine-tuning or robust generalization training is essential for production semantic search.)
What is late interaction semantic search (ColBERT) and when is it preferred? (Answer: ColBERT: encode query and document separately into per-token embeddings (not pooled sentence vectors). Scoring: sum of max cosine similarities between each query token and its best-matching document token (MaxSim). More accurate than bi-encoder (richer interaction) while allowing offline document indexing (unlike cross-encoder). Storage cost: all document token embeddings stored (much larger than bi-encoder). Use when: bi-encoder recall is insufficient, cross-encoder is too slow, and storage cost is acceptable.)
What is approximate nearest neighbor (ANN) indexing and which algorithm does FAISS use by default? (Answer: ANN trades exact nearest neighbor accuracy for massive speed gains. FAISS (Facebook AI Similarity Search) supports multiple index types: IndexFlatL2/IP (exact, brute force — baseline). IndexIVFFlat: inverted file index with cluster-based search — 10–100× faster, ~95% recall with nprobe=64. IndexHNSW: graph-based, excellent recall and speed balance, memory-intensive. IndexIVFPQ: inverted file + product quantization for memory compression. Production recommendation: IVF_HNSW for high-recall/low-latency; IVFPQ for memory-constrained deployments.)

Definition

Keyword search vs semantic search

The fundamental difference: keyword search finds documents that contain your words; semantic search finds documents that mean what you mean.

Dimension	Keyword / Lexical (BM25)	Semantic (Embedding-based)
How it works	Term frequency × inverse document frequency scoring	Encode query + docs to vectors; cosine similarity
Handles synonyms?	No — "heart attack" misses "myocardial infarction"	Yes — same region in embedding space
Handles paraphrase?	No — different words = no match	Yes — meaning preserved in embedding
Handles typos?	Partially (fuzzy matching add-ons)	Yes — nearby spelling = similar embedding
Speed	Very fast — inverted index lookup	Fast — ANN search, ~1–10ms
Interpretable?	Yes — exact term matches visible	Less so — similarity score only
When it fails	Vocabulary mismatch, paraphrase, concept queries	Very specific technical terms, very short docs
Best for	Exact product names, codes, legal terms	General Q&A, intent search, FAQ matching

Hybrid search always wins

Production search systems (Google, Elasticsearch, Weaviate, pgvector) consistently find that hybrid search — BM25 score + semantic similarity score, combined with a cross-encoder re-ranker — outperforms either alone. The intuition: BM25 catches exact technical terms that embeddings blur; semantic search catches synonyms and intent that BM25 misses. Use Reciprocal Rank Fusion (RRF) to merge the two ranked lists.

Bi-encoder vs cross-encoder architecture

The core tension in semantic search: accuracy vs speed. Bi-encoders are fast but less accurate; cross-encoders are slow but very accurate. The standard solution: use both in a two-stage pipeline.

Property	Bi-encoder	Cross-encoder
How it works	Encode query and doc separately → fixed vectors → cosine similarity	Concatenate query + doc → run full model → single relevance score
Token interaction	None — query and doc never attend to each other	Full cross-attention between every query and doc token
Indexing	Pre-compute doc embeddings offline; query embedding at runtime	Cannot pre-compute — must run at query time with each doc
Latency	~1–10ms for ANN search over millions of docs	~100–500ms per query for top-100 docs
Quality	Good — limited by single-vector bottleneck	Best — tokens attend directly to each other
Scale	Billions of docs — ANN index handles it	Only feasible for small candidate sets (~100–500 docs)
Example models	E5, BGE, Voyage, text-embedding-3-large	MS-MARCO cross-encoder, Cohere Rerank, bge-reranker

The retrieve-then-rerank pipeline

Industry standard: (1) Bi-encoder retrieves top-100 candidates from millions of docs in ~10ms. (2) Cross-encoder re-ranks those 100 to produce final top-10 in ~200ms. Total latency ~210ms — the quality of a cross-encoder at the scale of a bi-encoder. Cohere Rerank, Jina Reranker, and BGE Reranker are popular choices for the reranking step.

Two-stage retrieval pipeline with SentenceTransformers

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Stage 1: Bi-encoder retrieval
bi_encoder = SentenceTransformer('intfloat/e5-large-v2')
docs = ["heart attack symptoms", "myocardial infarction treatment", "chest pain causes", ...]

# Pre-compute document embeddings (done once, cached)
doc_embeddings = bi_encoder.encode(["passage: " + d for d in docs], normalize_embeddings=True)

query = "query: what causes heart attacks"
query_emb = bi_encoder.encode([query], normalize_embeddings=True)

# Retrieve top-100 by cosine similarity
scores = (query_emb @ doc_embeddings.T)[0]
top100_indices = np.argsort(scores)[::-1][:100]
top100_docs = [docs[i] for i in top100_indices]

# Stage 2: Cross-encoder reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
pairs = [("what causes heart attacks", doc) for doc in top100_docs]
rerank_scores = cross_encoder.predict(pairs)

# Final top-10
top10 = sorted(zip(rerank_scores, top100_docs), reverse=True)[:10]
for score, doc in top10:
    print(f"{score:.3f}  {doc}")

Dense Passage Retrieval (DPR)

DPR contrastive loss: maximize similarity between (query, positive passage) pairs while minimizing similarity to negative passages. sim(q, p) = dot product of BERT encodings. In-batch negatives: other passages in the batch serve as easy negatives; BM25 hard negatives added for harder training signal.

Aspect	Detail
Architecture	Two independent BERT encoders — one for queries, one for passages
Training data	Natural Questions + TriviaQA — Wikipedia passages, annotated with gold passages
Negative mining	In-batch negatives + BM25 hard negatives (passages that contain query terms but aren't the answer)
Result on NQ	Top-20 retrieval accuracy: DPR 79.4% vs BM25 59.1% — +20 points
Legacy	Training paradigm (contrastive, hard negatives) adopted by E5, BGE, GTE, Voyage, text-embedding-3

Hard negatives are the key

The biggest DPR insight: easy negatives (random passages) make the model lazy — it only needs to avoid obviously unrelated content. Hard negatives (passages that look relevant but aren't) force the encoder to develop a precise semantic understanding. Modern embedding models (E5, BGE, GTE) spend significant effort on hard negative mining strategies — using BM25, a weaker model, or mined adversarial examples to generate challenging training pairs.

Neural semantic search in production

A production semantic search system has three phases: offline indexing, online retrieval, and optional reranking. Each phase has distinct engineering tradeoffs.

End-to-end semantic search with pgvector (PostgreSQL)

import psycopg2
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/e5-large-v2')

# ── INDEXING PHASE (run once / on updates) ─────────────────────────
conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute(
    "CREATE TABLE IF NOT EXISTS documents "
    "(id SERIAL PRIMARY KEY, content TEXT, embedding vector(1024))"
)
cur.execute(
    "CREATE INDEX IF NOT EXISTS doc_idx ON documents "
    "USING hnsw (embedding vector_cosine_ops) WITH (m=16, ef_construction=64)"
)

docs = ["Introduction to neural networks...", "Attention mechanism explained..."]
embeddings = model.encode(["passage: " + d for d in docs], normalize_embeddings=True)
for doc, emb in zip(docs, embeddings):
    cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)",
                (doc, emb.tolist()))
conn.commit()

# ── RETRIEVAL PHASE (per query) ─────────────────────────────────────
query = "query: how does self-attention work"
q_emb = model.encode([query], normalize_embeddings=True)[0]
cur.execute(
    "SELECT content, 1 - (embedding <=> %s::vector) AS sim "
    "FROM documents ORDER BY embedding <=> %s::vector LIMIT 10",
    (q_emb.tolist(), q_emb.tolist())
)
for text, sim in cur.fetchall():
    print(f"{sim:.3f}  {text[:80]}")

Vector DB	Best for	Hosting	ANN algorithm
pgvector	Existing PostgreSQL infra; moderate scale (<50M vecs)	Self-hosted or Supabase	IVFFlat or HNSW
Pinecone	Managed, zero ops, horizontal scale	Fully managed SaaS	Proprietary (HNSW-based)
Weaviate	Hybrid search built-in; multi-modal	Self-hosted or managed	HNSW
Qdrant	High performance, Rust core, filtering-heavy workloads	Self-hosted or managed	HNSW
Chroma	Local dev, prototyping, embedded use	Self-hosted (embedded lib)	HNSW (via hnswlib)

Chunking strategy matters most

The single biggest factor in semantic search quality is not the model — it's how you chunk documents. Too large: retrieved chunks contain mostly irrelevant content. Too small: a single concept is split across chunks, losing context. Rule of thumb: 256–512 tokens with 50-token overlap for general text. Use semantic chunking (split at paragraph/section boundaries) when document structure is available.

ColBERT: late interaction for efficiency

ColBERT MaxSim scoring: for each query token embedding, find the most similar document token embedding (MaxSim). Sum these per-query-token max scores to get the final relevance score. Documents are pre-encoded to token vectors and stored in compressed form.

Architecture	Storage	Query latency	Quality	Scale
Bi-encoder (single vector)	1 vector per doc	<5ms ANN	Good	Billions of docs
ColBERT (late interaction)	N token vectors per doc (~128 tokens)	20–50ms	Near cross-encoder	100M+ docs with PLAID
Cross-encoder (full attention)	None (compute at query time)	100–500ms for top-100	Best	Only reranking, not retrieval

ColBERT v2 + PLAID

ColBERT v2 (2022) added residual compression — reducing storage by 6–10× with minimal quality loss. PLAID (2022) adds a fast candidate generation phase before full ColBERT scoring, enabling sub-100ms retrieval over 100M+ documents. RAGatouille (Python library) makes ColBERT v2 accessible without custom infrastructure — one-line indexing and search.

Practice questions

What is the difference between semantic search and lexical search in terms of query handling? (Answer: Lexical search (BM25): tokenises query and document, computes term frequency and inverse document frequency scores. Handles: exact term matching, rare technical terms, product codes, proper nouns. Fails on: paraphrase, synonym, cross-language queries. Semantic search (bi-encoder): converts query and document to dense embeddings, retrieves by vector similarity. Handles: paraphrase ('cheap car' matches 'affordable vehicle'), intent ('how to fix' matches troubleshooting docs). Fails on: exact code/ID matching, very rare domain terms not in training data.)
What is a bi-encoder vs cross-encoder for semantic search and when do you use each? (Answer: Bi-encoder: encode query and document INDEPENDENTLY → embeddings stored offline. Query at search time: encode query, ANN search. Very fast (O(1) per query). Good recall but not optimal precision. Cross-encoder: jointly encode (query, document) pair → single relevance score. Much more accurate (attends across both). Cannot pre-encode — must run for every (query, document) pair at query time. Too slow for first-stage retrieval. Architecture: bi-encoder for retrieval (recall), cross-encoder for re-ranking top-k results (precision).)
What is BEIR (Benchmarking IR) and what has it revealed about semantic search generalization? (Answer: BEIR (Thakur et al. 2021): 18 diverse information retrieval benchmarks (MSMARCO, TREC-COVID, NQ, ArguAna, etc.) covering different domains and query types. Key finding: models fine-tuned on one retrieval dataset (MSMARCO) significantly underperform on other domains — generalization is poor. Dense retrieval often underperforms BM25 on out-of-domain data. Hybrid BM25 + dense retrieval consistently outperforms either alone. Conclusion: domain-specific fine-tuning or robust generalization training is essential for production semantic search.)
What is late interaction semantic search (ColBERT) and when is it preferred? (Answer: ColBERT: encode query and document separately into per-token embeddings (not pooled sentence vectors). Scoring: sum of max cosine similarities between each query token and its best-matching document token (MaxSim). More accurate than bi-encoder (richer interaction) while allowing offline document indexing (unlike cross-encoder). Storage cost: all document token embeddings stored (much larger than bi-encoder). Use when: bi-encoder recall is insufficient, cross-encoder is too slow, and storage cost is acceptable.)
What is approximate nearest neighbor (ANN) indexing and which algorithm does FAISS use by default? (Answer: ANN trades exact nearest neighbor accuracy for massive speed gains. FAISS (Facebook AI Similarity Search) supports multiple index types: IndexFlatL2/IP (exact, brute force — baseline). IndexIVFFlat: inverted file index with cluster-based search — 10–100× faster, ~95% recall with nprobe=64. IndexHNSW: graph-based, excellent recall and speed balance, memory-intensive. IndexIVFPQ: inverted file + product quantization for memory compression. Production recommendation: IVF_HNSW for high-recall/low-latency; IVFPQ for memory-constrained deployments.)

Semantic Search

Keyword search vs semantic search

Bi-encoder vs cross-encoder architecture

Dense Passage Retrieval (DPR)

Neural semantic search in production

ColBERT: late interaction for efficiency

Practice questions

Semantic Search

Keyword search vs semantic search

Bi-encoder vs cross-encoder architecture

Dense Passage Retrieval (DPR)

Neural semantic search in production

ColBERT: late interaction for efficiency

Practice questions

Practice what you just learned

Related Terms