Definition

Reinforcement learning is an ML paradigm where an agent learns to make decisions by interacting with an environment, receiving rewards for good actions and penalties for bad ones. The agent's goal is to learn a policy — a mapping from states to actions — that maximizes cumulative long-term reward. RL powers game-playing AI, robotics, and the RLHF technique used to align LLMs.

Core concepts: agent, environment, reward

Reinforcement learning is formalized as a Markov Decision Process (MDP). The agent's goal is to learn a policy π(a|s) that maximizes the expected sum of discounted future rewards:

Return G_t: total discounted future reward from timestep t. γ ∈ [0,1] is the discount factor — γ close to 1 values future rewards highly, γ close to 0 makes the agent focus on immediate rewards.

State value function V^π(s): expected return when starting in state s and following policy π. The optimal policy maximizes V for every state.

Component	Description	Example (chess AI)
State s	Current observable situation	Board position (8×8 grid, piece locations)
Action a	Choice made by the agent	Move a piece from (e2) to (e4)
Reward r	Scalar feedback signal	+1 win, −1 loss, 0 draw
Policy π(a\|s)	Maps states to action probabilities	Probability of each legal move given board
Episode	Complete sequence: start → terminal	One chess game from opening to checkmate
Discount γ	How much future rewards are valued	0.99 (care about winning, not just next move)

Value functions and Q-learning

Q-learning learns the action-value function Q(s,a) — the expected return from taking action a in state s. The Bellman equation provides a recursive definition that enables iterative learning:

Bellman optimality equation for Q*. The optimal Q value for (s,a) equals the immediate reward r plus the discounted value of the best action in the next state s'.

Q-learning update rule. α = learning rate. The term in brackets is the TD error (temporal difference error) — how wrong our current Q estimate is.

Q-learning from scratch on a simple grid world

import numpy as np

# Grid world: 4×4 states, 4 actions (up/down/left/right), goal at (3,3)
n_states, n_actions = 16, 4
Q = np.zeros((n_states, n_actions))   # Initialize Q table

def step(s, a):
    """Simplified grid world step function."""
    row, col = s // 4, s % 4
    moves = [(-1,0),(1,0),(0,-1),(0,1)]   # up, down, left, right
    dr, dc = moves[a]
    new_row = max(0, min(3, row + dr))
    new_col = max(0, min(3, col + dc))
    s_next = new_row * 4 + new_col
    reward = 1.0 if s_next == 15 else -0.01  # +1 at goal, -0.01 otherwise
    done = (s_next == 15)
    return s_next, reward, done

# Training loop
lr, gamma, epsilon = 0.1, 0.99, 0.3

for episode in range(5000):
    s = 0   # start at (0,0)
    for _ in range(100):
        # ε-greedy action selection
        if np.random.rand() < epsilon:
            a = np.random.randint(n_actions)    # explore
        else:
            a = np.argmax(Q[s])                 # exploit

        s_next, r, done = step(s, a)

        # Q-learning update (Bellman equation)
        td_error = r + gamma * np.max(Q[s_next]) * (1 - done) - Q[s, a]
        Q[s, a] += lr * td_error

        s = s_next
        if done: break

print("Learned Q-values at start state (0):", Q[0].round(3))
print("Optimal action:", ['up','down','left','right'][np.argmax(Q[0])])

Policy gradient methods

Policy gradient methods directly optimize the policy parameters θ by gradient ascent on expected return. The REINFORCE policy gradient theorem gives the gradient of J(θ):

Policy gradient theorem (Williams, 1992): increase the log-probability of actions that led to high returns, decrease it for low-return actions. G_t is the return from timestep t.

PPO (Proximal Policy Optimization) clipped objective. r_t(θ) = π_θ(a|s)/π_θ_old(a|s) is the probability ratio. Clipping to [1-ε, 1+ε] prevents large destabilizing updates.

Algorithm	Key idea	Used in
REINFORCE	Monte Carlo policy gradient — update after full episodes	Simple environments, theoretical baseline
Actor-Critic (A2C/A3C)	Policy net (actor) + value net (critic) reduces variance	Atari, continuous control
PPO	Clip probability ratio to prevent large updates	ChatGPT RLHF, robotics, game AI
SAC	Maximize reward + entropy (explore more)	Continuous control, robotics
GRPO	Group relative policy optimization — no critic needed	DeepSeek-R1 reasoning

RL landmarks and applications

Key milestones that show what RL can achieve at scale:

Milestone	Year	Key contribution
DQN (DeepMind)	2015	Deep Q-network achieves human-level Atari performance on 49 games from raw pixels
AlphaGo	2016	Defeats world Go champion Lee Sedol; Go has ~10^170 board states
AlphaZero	2017	Surpasses all prior Chess/Go/Shogi engines using only self-play, zero human knowledge
OpenAI Five	2019	Defeats Dota 2 world champions with 45,000 years of self-play experience
InstructGPT / ChatGPT	2022	RLHF + PPO aligns GPT-3 with human preferences — transforms text completion into assistants
DeepSeek-R1	2025	Pure RL (GRPO) teaches chain-of-thought reasoning without supervised data

Compute required for AlphaZero

AlphaZero trained for 9 hours on 5,000 TPUs for chess, playing 44 million games against itself. Each game generates training data: (board state, move probabilities, game outcome). The final policy is the result of 700,000 gradient steps.

RL for LLM alignment (RLHF)

RLHF applies RL to steer LLM behavior toward human preferences. The policy is the LLM, actions are tokens, and the reward comes from a human-trained reward model. The combined training objective penalizes drift from the SFT model via a KL divergence term:

Total RLHF reward: reward model score minus β × KL divergence from SFT model. β controls how much the model is allowed to deviate from safe SFT behavior. Without the KL term, the model reward-hacks by generating nonsense text that tricks the reward model.

RLHF reward model training (simplified Bradley-Terry preference model)

import torch
import torch.nn as nn
import torch.nn.functional as F

class RewardModel(nn.Module):
    """Reward model: LLM backbone + linear head outputting scalar reward."""
    def __init__(self, backbone):
        super().__init__()
        self.backbone = backbone                          # e.g. GPT-2
        self.reward_head = nn.Linear(backbone.config.n_embd, 1)

    def forward(self, input_ids):
        hidden = self.backbone(input_ids).last_hidden_state[:, -1, :]
        return self.reward_head(hidden).squeeze(-1)   # scalar reward


def preference_loss(reward_model, chosen_ids, rejected_ids):
    """
    Bradley-Terry loss: maximize gap between chosen and rejected rewards.
    Equivalent to binary cross-entropy on (chosen, rejected) pairs.
    """
    r_chosen   = reward_model(chosen_ids)    # should be higher
    r_rejected = reward_model(rejected_ids)  # should be lower

    # Minimize NLL of the preference: log σ(r_chosen - r_rejected)
    loss = -F.logsigmoid(r_chosen - r_rejected).mean()
    return loss

# Training: for each (prompt, chosen_response, rejected_response) triplet,
# compute preference_loss and backpropagate.

RLHF → DPO evolution

RLHF requires training three separate models (SFT → reward model → policy with PPO). DPO (Rafailov et al., 2023) showed you can skip the reward model entirely by directly optimizing the policy from preferences — reducing RLHF's 3-stage pipeline to a single fine-tuning step. Most modern open-source model training now uses DPO or its variants (ORPO, SimPO).

Practice questions

What is the Bellman equation and why is it fundamental to RL? (Answer: Bellman equation: V(s) = max_a [R(s,a) + γ Σ P(s'|s,a) V(s')]. The value of a state = maximum expected immediate reward + discounted future value, averaged over transition probabilities. Key insight: the optimal value function satisfies this recursive equation. Value iteration: repeatedly apply the Bellman update until convergence — finds optimal policy without learning a model. Q-learning: learns Q(s,a) values using a sample-based Bellman update. The Bellman equation is the bridge between MDP theory and practical RL algorithms.)
What is the exploration-exploitation trade-off and how does ε-greedy address it? (Answer: Exploitation: choose the action with highest estimated reward (greedy). May get stuck in local optima. Exploration: try other actions to discover better strategies. Wastes short-term reward. ε-greedy: with probability ε explore (random action), with probability 1-ε exploit (best known action). Common schedule: ε=1.0 at start (pure exploration), decay to ε=0.01 over training (mostly exploit). Trade-off: too much exploration → slow learning; too little → local optima. UCB (Upper Confidence Bound) exploration is theoretically optimal for bandits.)
What is the difference between on-policy and off-policy learning? (Answer: On-policy: learns the value of the policy currently being used for exploration. SARSA: updates Q(s,a) using the action actually taken by the current policy. The behaviour policy = target policy. Off-policy: learns the optimal policy while following a different (exploration) policy. Q-learning: updates Q(s,a) using max_a'Q(s',a') — the optimal action, even if a different action was taken. Off-policy enables: learning from historical data (logged actions), using a safer exploration policy while learning an optimal one.)
Why is training a Deep Q-Network (DQN) unstable without experience replay and target networks? (Answer: Instability sources: (1) Correlated samples: consecutive transitions (s_t, a_t, r_t, s_{t+1}) are highly correlated — violates the IID assumption of gradient descent. Experience replay: store transitions in buffer, sample random mini-batches — breaks correlations. (2) Moving target problem: the target Q(s', max_a') uses the same network being updated — the target chases a moving target. Separate target network: a frozen copy updated every N steps. These two stabilisations (Mnih et al. 2015) made Atari game DQN training feasible.)
What is policy gradient and how does REINFORCE work? (Answer: Policy gradient: directly optimise the policy π_θ (a neural network) by gradient ascent on expected return J(θ) = E[Σ r_t]. REINFORCE: ∇J(θ) = E[G_t ∇log π_θ(a_t|s_t)] where G_t = Σγ^k r_{t+k} (discounted return). In practice: sample a full episode, compute G_t at each step, update θ ← θ + α G_t ∇log π_θ(a_t|s_t). Intuition: increase log probability of actions that led to high returns, decrease for low returns. High variance (needs many episodes). Actor-Critic methods (PPO, A3C) reduce variance.)

Reinforcement Learning (RL)

Core concepts: agent, environment, reward

Value functions and Q-learning

Policy gradient methods

RL landmarks and applications

RL for LLM alignment (RLHF)

Practice questions

Try LumiChats for ₹69

Related Terms