Definition

Conversational AI systems hold multi-turn dialogues with users, maintaining context across turns. Types: rule-based (pattern matching), retrieval-based (select best response from a database), generative (LLM generates novel responses). Modern production chatbots combine retrieval (RAG for factual grounding) with LLM generation. Key challenges: coherence across turns, grounding in facts, safety filtering, persona consistency, and latency. Building a conversational AI system requires NLU (understanding intent), dialogue management (tracking state), and NLG (generating responses).

Architecture types

Type	How it works	Pros	Cons	Examples
Rule-based	Pattern matching + decision trees	Predictable, auditable, fast	Cannot handle variations, brittle	Bank IVR, early Siri
Retrieval-based	Encode query, search response database	Factually safe, fast	Cannot generate novel responses	FAQ bots, customer support
Generative (LLM)	Transformer generates response from prompt	Flexible, natural, novel	Can hallucinate, expensive, slow	ChatGPT, Claude, Gemini
Hybrid (RAG)	Retrieve relevant docs + LLM generates with context	Factual + natural	Complex pipeline, latency	Enterprise chatbots, 2024 standard

Building a simple conversational system

Multi-turn chatbot with conversation memory using Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from collections import deque

class ConversationalBot:
    """Simple multi-turn chatbot with sliding context window."""

    def __init__(self, model_name="microsoft/DialoGPT-medium",
                 max_history_turns=5):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model     = AutoModelForCausalLM.from_pretrained(model_name)
        self.history   = deque(maxlen=max_history_turns * 2)  # user+bot turns
        self.model.eval()

    def chat(self, user_input: str) -> str:
        # Encode user input
        user_ids = self.tokenizer.encode(
            user_input + self.tokenizer.eos_token, return_tensors='pt')

        # Build context: all history + current input
        if self.history:
            context_ids = torch.cat([
                torch.cat(list(self.history), dim=-1),
                user_ids
            ], dim=-1)
        else:
            context_ids = user_ids

        # Generate response
        with torch.no_grad():
            response_ids = self.model.generate(
                context_ids,
                max_new_tokens=200,
                pad_token_id=self.tokenizer.eos_token_id,
                do_sample=True, top_p=0.92, temperature=0.7,
                no_repeat_ngram_size=3,    # Prevent repetition
            )

        # Extract only the new tokens (not the input)
        new_tokens    = response_ids[:, context_ids.shape[-1]:]
        response_text = self.tokenizer.decode(new_tokens[0], skip_special_tokens=True)

        # Store turn in history
        self.history.append(user_ids)
        self.history.append(new_tokens)

        return response_text

    def reset(self):
        self.history.clear()
        print("Conversation history cleared.")

# ── Production pattern: OpenAI API with conversation history ──
from openai import OpenAI
client = OpenAI()   # Requires OPENAI_API_KEY

def chat_with_memory(user_message: str, conversation_history: list) -> str:
    """Production pattern: send full history to LLM API each turn."""
    conversation_history.append({"role": "user", "content": user_message})
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful NLP tutor."},
            *conversation_history,
        ],
        max_tokens=500,
        temperature=0.7,
    )
    assistant_reply = response.choices[0].message.content
    conversation_history.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

# Each API call sends the FULL conversation history
# This is how ChatGPT/Claude maintain context — no server-side state
# The "memory" is in the client, passed as context each turn
history = []
for user_msg in ["What is BERT?", "How is it different from GPT?", "Which is better for chatbots?"]:
    reply = chat_with_memory(user_msg, history)
    print(f"User: {user_msg}")
    print(f"Bot:  {reply[:100]}")
    print()

Dialogue systems components

Natural Language Understanding (NLU): Detect user intent (book_flight, check_balance) and extract entities (destination=Paris, date=Friday). BERT-based classifiers are standard.
Dialogue State Tracking (DST): Maintain a structured representation of what has been discussed — slots filled, context, current goal. Critical for task-oriented bots.
Dialogue Policy: Decide what action to take given the current state — request more info, confirm, execute action, or apologise.
Natural Language Generation (NLG): Convert the action into a natural language response — either template-based ("Your flight to {destination} is booked") or LLM-generated.

How ChatGPT maintains context

ChatGPT does not store conversation state on the server. Each API call sends the ENTIRE conversation history (all previous messages) as context tokens. The LLM reads all previous turns and generates the next response. This is why very long conversations are expensive (more tokens) and eventually hit context window limits. The "memory" is in the message list passed by the client.

Practice questions

What is the difference between a retrieval-based and a generative chatbot? (Answer: Retrieval: selects the best pre-written response from a database — factually safe, fast, but cannot handle novel questions. Generative: LLM generates novel responses token by token — flexible and natural, but can hallucinate. Production systems increasingly use hybrid: retrieve relevant facts + LLM generates the response grounded in retrieved facts.)
Why does a chatbot send the full conversation history on every turn? (Answer: Transformer LLMs are stateless — they have no persistent memory between calls. The only way to maintain conversational context is to include all previous turns in the input context window for each new call. This is why long conversations consume more tokens and eventually hit the context window limit.)
What is intent classification in a task-oriented chatbot? (Answer: Given user input "Book me a flight to Paris on Friday", classify the intent as BOOK_FLIGHT. The system then extracts slot values: destination=Paris, date=Friday. Downstream, it fills these slots into a travel booking API call. BERT fine-tuned on a labelled intent dataset is the standard approach.)
A chatbot gives confidently wrong answers about product prices. What architecture issue is this and how do you fix it? (Answer: Hallucination — the LLM generates plausible-sounding but incorrect facts. Fix with RAG: retrieve product price from a trusted database and inject it into the prompt context. The LLM generates the response grounded in the retrieved fact rather than relying on training data that may be outdated or wrong.)
What is "no-repeat-ngram-size=3" in text generation and why is it used? (Answer: Prevents the model from repeating any 3-gram (3 consecutive words) that already appeared in the output. Without it, LLMs can fall into repetitive loops ("I think that I think that I think that..."). Most generation configs use no-repeat-ngram-size=3 or 4 to prevent repetitive responses.)

On LumiChats

LumiChats is a production conversational AI system. The multi-turn context, intent understanding, RAG-based factual grounding, and safety filtering are all components described in this article. Every feature of LumiChats corresponds to a specific component of the conversational AI architecture.

Try it free

Conversational AI & Chatbots — Architecture and Design

Architecture types

Building a simple conversational system

Dialogue systems components

Practice questions

Try LumiChats for ₹69

Related Terms