Glossary/Conversational AI & Chatbots — Architecture and Design
Natural Language Processing

Conversational AI & Chatbots — Architecture and Design

Building AI systems that hold context across turns and respond naturally.


Definition

Conversational AI systems hold multi-turn dialogues with users, maintaining context across turns. Types: rule-based (pattern matching), retrieval-based (select best response from a database), generative (LLM generates novel responses). Modern production chatbots combine retrieval (RAG for factual grounding) with LLM generation. Key challenges: coherence across turns, grounding in facts, safety filtering, persona consistency, and latency. Building a conversational AI system requires NLU (understanding intent), dialogue management (tracking state), and NLG (generating responses).

Architecture types

TypeHow it worksProsConsExamples
Rule-basedPattern matching + decision treesPredictable, auditable, fastCannot handle variations, brittleBank IVR, early Siri
Retrieval-basedEncode query, search response databaseFactually safe, fastCannot generate novel responsesFAQ bots, customer support
Generative (LLM)Transformer generates response from promptFlexible, natural, novelCan hallucinate, expensive, slowChatGPT, Claude, Gemini
Hybrid (RAG)Retrieve relevant docs + LLM generates with contextFactual + naturalComplex pipeline, latencyEnterprise chatbots, 2024 standard

Building a simple conversational system

Multi-turn chatbot with conversation memory using Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from collections import deque

class ConversationalBot:
    """Simple multi-turn chatbot with sliding context window."""

    def __init__(self, model_name="microsoft/DialoGPT-medium",
                 max_history_turns=5):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model     = AutoModelForCausalLM.from_pretrained(model_name)
        self.history   = deque(maxlen=max_history_turns * 2)  # user+bot turns
        self.model.eval()

    def chat(self, user_input: str) -> str:
        # Encode user input
        user_ids = self.tokenizer.encode(
            user_input + self.tokenizer.eos_token, return_tensors='pt')

        # Build context: all history + current input
        if self.history:
            context_ids = torch.cat([
                torch.cat(list(self.history), dim=-1),
                user_ids
            ], dim=-1)
        else:
            context_ids = user_ids

        # Generate response
        with torch.no_grad():
            response_ids = self.model.generate(
                context_ids,
                max_new_tokens=200,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True, top_p=0.92, temperature=0.7,
                no_repeat_ngram_size=3,    # Prevent repetition
            )

        # Extract only the new tokens (not the input)
        new_tokens    = response_ids[:, context_ids.shape[-1]:]
        response_text = self.tokenizer.decode(new_tokens[0], skip_special_tokens=True)

        # Store turn in history
        self.history.append(user_ids)
        self.history.append(new_tokens)

        return response_text

    def reset(self):
        self.history.clear()
        print("Conversation history cleared.")

# ── Production pattern: OpenAI API with conversation history ──
from openai import OpenAI
client = OpenAI()   # Requires OPENAI_API_KEY

def chat_with_memory(user_message: str, conversation_history: list) -> str:
    """Production pattern: send full history to LLM API each turn."""
    conversation_history.append({"role": "user", "content": user_message})
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful NLP tutor."},
            *conversation_history,
        ],
        max_tokens=500,
        temperature=0.7,
    )
    assistant_reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

# Each API call sends the FULL conversation history
# This is how ChatGPT/Claude maintain context — no server-side state
# The "memory" is in the client, passed as context each turn
history = []
for user_msg in ["What is BERT?", "How is it different from GPT?", "Which is better for chatbots?"]:
    reply = chat_with_memory(user_msg, history)
    print(f"User: {user_msg}")
    print(f"Bot:  {reply[:100]}")
    print()

Dialogue systems components

  • Natural Language Understanding (NLU): Detect user intent (book_flight, check_balance) and extract entities (destination=Paris, date=Friday). BERT-based classifiers are standard.
  • Dialogue State Tracking (DST): Maintain a structured representation of what has been discussed — slots filled, context, current goal. Critical for task-oriented bots.
  • Dialogue Policy: Decide what action to take given the current state — request more info, confirm, execute action, or apologise.
  • Natural Language Generation (NLG): Convert the action into a natural language response — either template-based ("Your flight to {destination} is booked") or LLM-generated.

How ChatGPT maintains context

ChatGPT does not store conversation state on the server. Each API call sends the ENTIRE conversation history (all previous messages) as context tokens. The LLM reads all previous turns and generates the next response. This is why very long conversations are expensive (more tokens) and eventually hit context window limits. The "memory" is in the message list passed by the client.

Practice questions

  1. What is the difference between a retrieval-based and a generative chatbot? (Answer: Retrieval: selects the best pre-written response from a database — factually safe, fast, but cannot handle novel questions. Generative: LLM generates novel responses token by token — flexible and natural, but can hallucinate. Production systems increasingly use hybrid: retrieve relevant facts + LLM generates the response grounded in retrieved facts.)
  2. Why does a chatbot send the full conversation history on every turn? (Answer: Transformer LLMs are stateless — they have no persistent memory between calls. The only way to maintain conversational context is to include all previous turns in the input context window for each new call. This is why long conversations consume more tokens and eventually hit the context window limit.)
  3. What is intent classification in a task-oriented chatbot? (Answer: Given user input "Book me a flight to Paris on Friday", classify the intent as BOOK_FLIGHT. The system then extracts slot values: destination=Paris, date=Friday. Downstream, it fills these slots into a travel booking API call. BERT fine-tuned on a labelled intent dataset is the standard approach.)
  4. A chatbot gives confidently wrong answers about product prices. What architecture issue is this and how do you fix it? (Answer: Hallucination — the LLM generates plausible-sounding but incorrect facts. Fix with RAG: retrieve product price from a trusted database and inject it into the prompt context. The LLM generates the response grounded in the retrieved fact rather than relying on training data that may be outdated or wrong.)
  5. What is "no-repeat-ngram-size=3" in text generation and why is it used? (Answer: Prevents the model from repeating any 3-gram (3 consecutive words) that already appeared in the output. Without it, LLMs can fall into repetitive loops ("I think that I think that I think that..."). Most generation configs use no-repeat-ngram-size=3 or 4 to prevent repetitive responses.)

On LumiChats

LumiChats is a production conversational AI system. The multi-turn context, intent understanding, RAG-based factual grounding, and safety filtering are all components described in this article. Every feature of LumiChats corresponds to a specific component of the conversational AI architecture.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms