Conversational AI systems hold multi-turn dialogues with users, maintaining context across turns. Types: rule-based (pattern matching), retrieval-based (select best response from a database), generative (LLM generates novel responses). Modern production chatbots combine retrieval (RAG for factual grounding) with LLM generation. Key challenges: coherence across turns, grounding in facts, safety filtering, persona consistency, and latency. Building a conversational AI system requires NLU (understanding intent), dialogue management (tracking state), and NLG (generating responses).
Architecture types
| Type | How it works | Pros | Cons | Examples |
|---|---|---|---|---|
| Rule-based | Pattern matching + decision trees | Predictable, auditable, fast | Cannot handle variations, brittle | Bank IVR, early Siri |
| Retrieval-based | Encode query, search response database | Factually safe, fast | Cannot generate novel responses | FAQ bots, customer support |
| Generative (LLM) | Transformer generates response from prompt | Flexible, natural, novel | Can hallucinate, expensive, slow | ChatGPT, Claude, Gemini |
| Hybrid (RAG) | Retrieve relevant docs + LLM generates with context | Factual + natural | Complex pipeline, latency | Enterprise chatbots, 2024 standard |
Building a simple conversational system
Multi-turn chatbot with conversation memory using Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from collections import deque
class ConversationalBot:
"""Simple multi-turn chatbot with sliding context window."""
def __init__(self, model_name="microsoft/DialoGPT-medium",
max_history_turns=5):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.history = deque(maxlen=max_history_turns * 2) # user+bot turns
self.model.eval()
def chat(self, user_input: str) -> str:
# Encode user input
user_ids = self.tokenizer.encode(
user_input + self.tokenizer.eos_token, return_tensors='pt')
# Build context: all history + current input
if self.history:
context_ids = torch.cat([
torch.cat(list(self.history), dim=-1),
user_ids
], dim=-1)
else:
context_ids = user_ids
# Generate response
with torch.no_grad():
response_ids = self.model.generate(
context_ids,
max_new_tokens=200,
pad_token_id=self.tokenizer.eos_token_id,
do_sample=True, top_p=0.92, temperature=0.7,
no_repeat_ngram_size=3, # Prevent repetition
)
# Extract only the new tokens (not the input)
new_tokens = response_ids[:, context_ids.shape[-1]:]
response_text = self.tokenizer.decode(new_tokens[0], skip_special_tokens=True)
# Store turn in history
self.history.append(user_ids)
self.history.append(new_tokens)
return response_text
def reset(self):
self.history.clear()
print("Conversation history cleared.")
# ── Production pattern: OpenAI API with conversation history ──
from openai import OpenAI
client = OpenAI() # Requires OPENAI_API_KEY
def chat_with_memory(user_message: str, conversation_history: list) -> str:
"""Production pattern: send full history to LLM API each turn."""
conversation_history.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful NLP tutor."},
*conversation_history,
],
max_tokens=500,
temperature=0.7,
)
assistant_reply = response.choices[0].message.content
conversation_history.append({"role": "assistant", "content": assistant_reply})
return assistant_reply
# Each API call sends the FULL conversation history
# This is how ChatGPT/Claude maintain context — no server-side state
# The "memory" is in the client, passed as context each turn
history = []
for user_msg in ["What is BERT?", "How is it different from GPT?", "Which is better for chatbots?"]:
reply = chat_with_memory(user_msg, history)
print(f"User: {user_msg}")
print(f"Bot: {reply[:100]}")
print()Dialogue systems components
- Natural Language Understanding (NLU): Detect user intent (book_flight, check_balance) and extract entities (destination=Paris, date=Friday). BERT-based classifiers are standard.
- Dialogue State Tracking (DST): Maintain a structured representation of what has been discussed — slots filled, context, current goal. Critical for task-oriented bots.
- Dialogue Policy: Decide what action to take given the current state — request more info, confirm, execute action, or apologise.
- Natural Language Generation (NLG): Convert the action into a natural language response — either template-based ("Your flight to {destination} is booked") or LLM-generated.
How ChatGPT maintains context
ChatGPT does not store conversation state on the server. Each API call sends the ENTIRE conversation history (all previous messages) as context tokens. The LLM reads all previous turns and generates the next response. This is why very long conversations are expensive (more tokens) and eventually hit context window limits. The "memory" is in the message list passed by the client.
Practice questions
- What is the difference between a retrieval-based and a generative chatbot? (Answer: Retrieval: selects the best pre-written response from a database — factually safe, fast, but cannot handle novel questions. Generative: LLM generates novel responses token by token — flexible and natural, but can hallucinate. Production systems increasingly use hybrid: retrieve relevant facts + LLM generates the response grounded in retrieved facts.)
- Why does a chatbot send the full conversation history on every turn? (Answer: Transformer LLMs are stateless — they have no persistent memory between calls. The only way to maintain conversational context is to include all previous turns in the input context window for each new call. This is why long conversations consume more tokens and eventually hit the context window limit.)
- What is intent classification in a task-oriented chatbot? (Answer: Given user input "Book me a flight to Paris on Friday", classify the intent as BOOK_FLIGHT. The system then extracts slot values: destination=Paris, date=Friday. Downstream, it fills these slots into a travel booking API call. BERT fine-tuned on a labelled intent dataset is the standard approach.)
- A chatbot gives confidently wrong answers about product prices. What architecture issue is this and how do you fix it? (Answer: Hallucination — the LLM generates plausible-sounding but incorrect facts. Fix with RAG: retrieve product price from a trusted database and inject it into the prompt context. The LLM generates the response grounded in the retrieved fact rather than relying on training data that may be outdated or wrong.)
- What is "no-repeat-ngram-size=3" in text generation and why is it used? (Answer: Prevents the model from repeating any 3-gram (3 consecutive words) that already appeared in the output. Without it, LLMs can fall into repetitive loops ("I think that I think that I think that..."). Most generation configs use no-repeat-ngram-size=3 or 4 to prevent repetitive responses.)
On LumiChats
LumiChats is a production conversational AI system. The multi-turn context, intent understanding, RAG-based factual grounding, and safety filtering are all components described in this article. Every feature of LumiChats corresponds to a specific component of the conversational AI architecture.
Try it free