Glossary/Reinforcement Learning — Policy, Value, Q-Learning & Exploration
Machine Learning

Reinforcement Learning — Policy, Value, Q-Learning & Exploration

Learning by trial and error — agents, rewards, and the path to optimal decisions.


Definition

Reinforcement Learning (RL) trains an agent to make sequential decisions in an environment to maximise cumulative reward. Unlike supervised learning (labelled examples) or unsupervised learning (patterns), RL learns from the consequences of its own actions. Core components: Agent (learner/decision maker), Environment (world the agent interacts with), State (current situation), Action (what the agent does), Reward (feedback signal), Policy (strategy for choosing actions), and Value Function (expected future rewards). Q-Learning is the foundational model-free RL algorithm. Deep RL (DQN, PPO, A3C) powers AlphaGo, ChatGPT RLHF, and game-playing AI.

Real-life analogy: Training a dog

Training a dog to sit: the dog (agent) tries different behaviours (actions) in the room (environment). When it sits, you give a treat (positive reward). When it jumps, you say 'no' (negative reward). The dog learns to sit to maximise treats. It does not need labelled examples — it discovers the optimal policy through trial, error, and reward signals. RL is generalised dog training for any sequential decision-making problem.

Core RL components and Markov Decision Process

ComponentSymbolDefinitionExample (chess)
Agent-The learner/decision makerChess program
Environment-Everything outside the agentChess board + opponent
States ∈ SCurrent situation of the environmentCurrent board position (all piece locations)
Actiona ∈ A(s)What the agent does in state sA legal chess move
RewardrScalar feedback after each action+1 win, -1 lose, 0 draw
Policyπ(a|s)Strategy: probability of action a in state sWhich move to play in each position
Value functionV(s)Expected future reward from state sHow good is this position?
Q-functionQ(s,a)Expected reward for taking action a in state sHow good is this specific move?
Discount factorγ ∈ [0,1]Weight of future vs immediate rewards0.99: future rewards almost as valuable

Return G_t: discounted sum of future rewards from time t. γ (gamma) controls time preference: γ=0 → myopic (only immediate reward). γ=1 → far-sighted (future rewards equally valuable). Bellman equation: V(s) = E[r + γV(s')] — value of a state = immediate reward + discounted value of next state.

Q-Learning — the foundational RL algorithm

Q-Learning from scratch on FrozenLake environment

import numpy as np
import gymnasium as gym

# FrozenLake: 4x4 grid, agent must reach goal (G) without falling in holes (H)
# States: 0-15 (16 grid positions), Actions: 0=Left, 1=Down, 2=Right, 3=Up
env = gym.make('FrozenLake-v1', is_slippery=False)

# Q-TABLE: Q[state, action] = expected future reward
n_states  = env.observation_space.n   # 16
n_actions = env.action_space.n        # 4
Q = np.zeros((n_states, n_actions))

# Hyperparameters
alpha       = 0.8     # Learning rate: how quickly update Q values
gamma       = 0.95    # Discount factor: weight of future rewards
epsilon     = 1.0     # Exploration rate: probability of random action
eps_decay   = 0.995   # Epsilon decay per episode
eps_min     = 0.01    # Minimum exploration rate
n_episodes  = 2000

rewards_history = []

for episode in range(n_episodes):
    state, _ = env.reset()
    total_reward = 0

    for step in range(100):  # Max 100 steps per episode
        # EXPLORATION vs EXPLOITATION (epsilon-greedy)
        if np.random.random() < epsilon:
            action = env.action_space.sample()   # Explore: random action
        else:
            action = np.argmax(Q[state])          # Exploit: best known action

        # Take action, observe next state and reward
        next_state, reward, done, truncated, _ = env.step(action)

        # Q-LEARNING UPDATE (Bellman equation)
        # Q(s,a) ← Q(s,a) + α[r + γ max_a'Q(s',a') - Q(s,a)]
        Q[state, action] += alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )
        state = next_state
        total_reward += reward

        if done or truncated:
            break

    # Decay epsilon (less exploration over time as Q converges)
    epsilon = max(eps_min, epsilon * eps_decay)
    rewards_history.append(total_reward)

    if (episode + 1) % 500 == 0:
        avg_reward = np.mean(rewards_history[-100:])
        print(f"Episode {episode+1}: Avg reward (last 100) = {avg_reward:.3f}, ε = {epsilon:.3f}")

print("Learned Q-table (first 5 states):")
print(Q[:5].round(3))
print("Learned policy:", ['Left','Down','Right','Up'][np.argmax(Q[i])] for i in range(16))

Exploration vs Exploitation and modern deep RL

Exploration-Exploitation Dilemma: The agent must balance exploiting known good actions (taking the best known action) vs exploring new actions that might be even better. Pure exploitation: gets stuck in local optima. Pure exploration: never learns to perform well. Solutions: ε-greedy (explore randomly with probability ε), Upper Confidence Bound (UCB), Thompson Sampling.

AlgorithmKey ideaUse case
Q-LearningTabular Q(s,a) — works for small discrete spacesGrid worlds, simple games
DQN (Deep Q-Network)Neural network approximates Q(s,a)Atari games (DeepMind 2013)
Policy Gradient (REINFORCE)Directly optimise policy π(a|s) via gradient ascentContinuous action spaces
Actor-Critic (A2C/A3C)Separate policy (actor) and value (critic) networksRobotics, continuous control
PPO (Proximal Policy Optimisation)Policy gradient with clipped objective — stable trainingChatGPT RLHF, games, robotics
AlphaGo / AlphaZeroMonte Carlo Tree Search + deep RL self-playChess, Go, board games

RLHF — how ChatGPT/Claude are trained

Reinforcement Learning from Human Feedback (RLHF): (1) Supervised fine-tuning on high-quality demonstrations. (2) Train a reward model — humans rank multiple AI responses, reward model learns to predict human preference. (3) PPO optimises the LLM policy to maximise the reward model's score, subject to a KL constraint that prevents the policy from drifting too far from the SFT model. This is why LLMs follow instructions and are "helpful, harmless, and honest."

Practice questions

  1. Discount factor γ=0 vs γ=1 — what is the agent optimising in each case? (Answer: γ=0: myopic — only cares about immediate reward, ignores all future rewards. γ=1: far-sighted — treats all future rewards equally to immediate reward (no discounting). In practice: γ=0.95-0.99 for most tasks. γ<1 ensures the sum of infinite rewards is finite.)
  2. What is the Exploration-Exploitation dilemma? Give an example. (Answer: The agent must choose between exploiting the best known strategy (eating at your favourite restaurant) vs exploring new options that might be better (trying a new restaurant). Too much exploitation: never discovers better options. Too much exploration: wastes time on poor options. ε-greedy: explore randomly with probability ε.)
  3. Q-Learning update equation: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]. What is the term in brackets? (Answer: The TD error (Temporal Difference error) — the difference between the current Q estimate and the Bellman target (r + γ max Q(s',a')). If TD error > 0: current Q is too low, increase it. TD error < 0: current Q is too high, decrease it.)
  4. Why does DQN use a separate target network for the Bellman target? (Answer: Without a separate target network, both the Q network and the target (max Q(s',a')) change at every update step — like chasing a moving target. This causes instability and divergence. The target network is frozen for C steps then updated, making the target more stable during training.)
  5. What is the difference between a model-free and model-based RL algorithm? (Answer: Model-free (Q-Learning, PPO): learns directly from experience without modelling the environment dynamics P(s'|s,a). Model-based (Dyna, AlphaZero): learns a model of environment dynamics, then uses it to plan or generate simulated experience. Model-based is more sample-efficient but requires accurate models.)

On LumiChats

Claude and ChatGPT use PPO-based RLHF as the final training stage. Understanding Q-Learning and policy gradients directly explains why LLMs follow instructions — they are policy networks trained to maximise a human preference reward signal.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms