Glossary/RAG (Retrieval-Augmented Generation)
AI Fundamentals

RAG (Retrieval-Augmented Generation)

How AI answers questions from your own documents.


Definition

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances a language model's responses by first retrieving relevant information from a knowledge base, then generating a response grounded in that retrieved content. RAG solves LLMs' knowledge cutoff and hallucination problems by giving models access to specific, up-to-date, or proprietary information at inference time.

The problem RAG solves

Standard LLMs answer from their parametric memory (weights trained on a fixed corpus with a cutoff date). They can't know about events after training, have no access to your private documents, and may confuse or misremember specific details.

Analogy

RAG is like allowing a student to reference their textbook during an exam, rather than relying purely on memory. The model still does the reasoning — but it's grounded in retrieved evidence, not confabulation.

  • Knowledge cutoff: Base LLMs don't know events after their training cutoff.
  • Private data: Your company's internal documents, your personal PDFs — none of this is in the training data.
  • Hallucination: Without source grounding, models generate plausible-sounding but incorrect specific facts.
  • Attribution: RAG lets the model cite exact sources (page numbers, documents).

The full RAG pipeline

RAG has two phases: offline indexing (one-time) and online query (per request):

Complete RAG pipeline — indexing phase

from openai import OpenAI
import numpy as np
from typing import List, Dict

client = OpenAI()

# ══════════════════════════════════════════════════════════
#  PHASE 1: INDEXING (runs once when you upload a document)
# ══════════════════════════════════════════════════════════

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 100) -> List[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    step = chunk_size - overlap

    for i in range(0, len(words), step):
        chunk = " ".join(words[i : i + chunk_size])
        if chunk:
            chunks.append(chunk)

    return chunks


def embed_texts(texts: List[str]) -> np.ndarray:
    """Embed a list of texts using OpenAI's embedding model."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    embeddings = [r.embedding for r in response.data]
    return np.array(embeddings)


# Your document (in practice, extracted from PDF)
document = """
Photosynthesis is the process by which plants convert sunlight into glucose.
It occurs in two stages: the light-dependent reactions in the thylakoids,
and the Calvin cycle in the stroma of the chloroplast.
The overall equation is: 6CO2 + 6H2O + light → C6H12O6 + 6O2.
...
"""

# Split into chunks and embed
chunks = chunk_text(document)
chunk_embeddings = embed_texts(chunks)     # shape: (n_chunks, 1536)

# Normalize for fast cosine similarity
norms = np.linalg.norm(chunk_embeddings, axis=1, keepdims=True)
chunk_embeddings_norm = chunk_embeddings / norms

# Store in "vector database" (in-memory here; use pgvector/Pinecone in production)
index = {"chunks": chunks, "embeddings": chunk_embeddings_norm}

print(f"Indexed {len(chunks)} chunks")

Complete RAG pipeline — query phase

# ══════════════════════════════════════════════════════════
#  PHASE 2: QUERY (runs on every user question)
# ══════════════════════════════════════════════════════════

def retrieve(query: str, index: Dict, top_k: int = 3) -> List[str]:
    """Find the most relevant chunks for a query."""
    query_emb = embed_texts([query])[0]
    query_emb = query_emb / np.linalg.norm(query_emb)     # normalize

    # Cosine similarity = dot product with normalized vectors
    scores = index["embeddings"] @ query_emb               # (n_chunks,)
    top_indices = np.argsort(scores)[::-1][:top_k]

    return [index["chunks"][i] for i in top_indices]


def rag_answer(question: str, index: Dict) -> str:
    """Retrieve relevant context, then generate a grounded answer."""

    # Step 1: Retrieve
    context_chunks = retrieve(question, index, top_k=3)
    context = "\n\n".join(f"[Chunk {i+1}]: {c}" for i, c in enumerate(context_chunks))

    # Step 2: Generate (grounded answer)
    system_prompt = """You are a precise AI assistant.
Answer ONLY from the provided context below.
If the answer is not in the context, say "I don't have that information in the provided documents."
Always cite which chunk your answer comes from."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0    # deterministic for factual Q&A
    )

    return response.choices[0].message.content


answer = rag_answer("What is the equation for photosynthesis?", index)
print(answer)
# "According to Chunk 1, the equation for photosynthesis is:
#  6CO2 + 6H2O + light → C6H12O6 + 6O2"

Naive RAG vs Advanced RAG

Basic RAG (retrieve → generate) breaks down in real-world scenarios. Advanced RAG techniques address common failure modes:

ProblemTechniqueHow it helps
Query ambiguousQuery rewriting / HyDERewrite query or generate a hypothetical answer to embed, then retrieve
Multi-hop questionsIterative retrievalRetrieve → generate sub-answer → use sub-answer to retrieve more context
Irrelevant chunks retrievedRe-rankingUse a cross-encoder to re-rank top-k retrieved chunks by true relevance
Keyword terms missedHybrid searchCombine dense vector search + sparse BM25 keyword search
Large documentsHierarchical indexingIndex summaries + full chunks; search summaries first for efficiency

Why RAG dramatically reduces hallucination

The key instruction that eliminates most hallucination in RAG systems:

System prompt that grounds the model in retrieved evidence

SYSTEM: You are a precise study assistant. Answer questions based ONLY on 
the provided context passages below. 

Rules:
1. If the answer is directly stated in the context, cite it with [Chunk N].  
2. If the answer is implied but not directly stated, say "Based on the context..."
3. If the answer is NOT in the context, respond: "This isn't covered in the 
   provided material. Try asking a more specific question or uploading a 
   document that covers this topic."
4. NEVER use your general knowledge to fill gaps — only use the context.

This is critical for academic use: fabricating information could harm students.

RAG limitations

RAG is not foolproof. If the relevant chunk wasn't retrieved (retrieval failure), the model has no source to ground its answer and may hallucinate. This is why chunk size, overlap, and top-k tuning matter.

Practice questions

  1. What is the difference between naive RAG, advanced RAG, and modular RAG? (Answer: Naive RAG: query → embed query → retrieve top-k chunks → concatenate with prompt → generate. Simple but fragile. Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (re-ranking, compression) steps. Modular RAG: loosely coupled pipeline with interchangeable components — different retrievers, rankers, generators, memory modules can be swapped per use case. Modular RAG is the current production standard: enables A/B testing components, graceful degradation, and specialised modules for different query types.)
  2. What is HyDE (Hypothetical Document Embeddings) and when should you use it? (Answer: HyDE: instead of embedding the user query directly, prompt an LLM to generate a hypothetical answer to the query. Embed the hypothetical answer and use it as the search query. Rationale: the answer embedding is closer in embedding space to actual answer documents than the query embedding (questions and answers have different linguistic patterns). Use when: queries are short and ambiguous (the LLM expansion adds context), domain vocabulary differs between queries and documents, or retrieval quality is poor with direct query embedding.)
  3. What is the context window stuffing problem in RAG and how do re-ranking and compression address it? (Answer: Naive top-k retrieval may return: (1) Redundant chunks covering the same information. (2) Low-relevance chunks that merely contain query keywords. (3) More content than the context window can hold. Re-ranking (Cohere Rerank, BGE reranker): use a cross-encoder to score (query, chunk) relevance — more accurate than embedding similarity alone. Re-rank and select top-5 from top-50. Compression (LLMLingua, RECOMP): use an LLM to extract only the most relevant sentences from retrieved chunks — reducing token count by 2–5× before insertion.)
  4. What is the 'lost in the middle' problem for RAG and how do you mitigate it? (Answer: LLMs perform better when relevant context appears at the beginning or end of the prompt rather than the middle (Liu et al. 2023). For RAG with 10 retrieved chunks, the most relevant chunk should be first or last, not middle. Mitigation: (1) Re-rank by relevance and position most relevant chunk first. (2) Reverse order (most relevant last, just before the query). (3) Reduce number of retrieved chunks (fewer chunks = less middle). (4) Use models trained specifically for long-context RAG.)
  5. What is the difference between dense retrieval, sparse retrieval, and hybrid retrieval in RAG systems? (Answer: Sparse (BM25/TF-IDF): keyword matching, handles exact terms well, interpretable, fast. Fails on semantic synonyms. Dense (bi-encoder): embed query and documents, retrieve by cosine similarity. Handles semantic similarity but may miss exact matches. Hybrid (Reciprocal Rank Fusion): combine sparse and dense retrieval ranking lists. Example: BM25 rank + FAISS rank → RRF combined rank. Best of both: handles exact terms (BM25) AND semantic similarity (dense). Weaviate, Qdrant, OpenSearch all support hybrid search natively.)

On LumiChats

LumiChats Study Mode is built on a production RAG pipeline. Documents are chunked, embedded with text-embedding-3-large, and stored in pgvector. Every answer in Study Mode is retrieved from your specific document — cited by page number, never hallucinated from training data.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms