Backpropagation (backprop) is the algorithm used to compute gradients in neural networks — determining how much each parameter should change to reduce the loss. It applies the chain rule of calculus to efficiently propagate error signals from the output layer backward through all layers, enabling gradient-based optimization of networks with millions or billions of parameters.
Forward pass vs backward pass
Training a neural network has two phases:
- Forward pass: Input flows through the network layer by layer. Each layer applies a linear transformation (Wx + b) followed by a nonlinear activation (ReLU, GELU). The final output is compared to the ground truth using a loss function, producing a scalar loss value.
- Backward pass: The loss signal flows backward using the chain rule. Gradients are computed for every parameter with respect to the loss — these tell us: "if I increase this weight slightly, does loss go up or down, and by how much?"
Simple 2-layer neural network: forward and backward pass from scratch
import numpy as np
def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)
# Network: Input(3) → Hidden(4) → Output(1)
np.random.seed(42)
W1 = np.random.randn(4, 3) * 0.1 # (hidden, input)
b1 = np.zeros((4, 1))
W2 = np.random.randn(1, 4) * 0.1 # (output, hidden)
b2 = np.zeros((1, 1))
X = np.array([[1.0], [2.0], [3.0]]) # input
y = np.array([[1.0]]) # true label
lr = 0.01
for step in range(5):
# ─── FORWARD PASS ───────────────────────────────────────
z1 = W1 @ X + b1 # pre-activation: (4,1)
a1 = relu(z1) # post-activation: (4,1)
z2 = W2 @ a1 + b2 # output pre-activation: (1,1)
loss = 0.5 * (z2 - y)**2 # MSE loss
# ─── BACKWARD PASS (chain rule) ────────────────────────
dL_dz2 = z2 - y # ∂L/∂z2
dL_dW2 = dL_dz2 @ a1.T # ∂L/∂W2
dL_db2 = dL_dz2 # ∂L/∂b2
dL_da1 = W2.T @ dL_dz2 # ∂L/∂a1
dL_dz1 = dL_da1 * relu_grad(z1) # ∂L/∂z1 (chain rule through ReLU)
dL_dW1 = dL_dz1 @ X.T # ∂L/∂W1
dL_db1 = dL_dz1 # ∂L/∂b1
# ─── GRADIENT DESCENT UPDATE ───────────────────────────
W2 -= lr * dL_dW2; b2 -= lr * dL_db2
W1 -= lr * dL_dW1; b1 -= lr * dL_db1
print(f"Step {step+1}: loss = {float(loss):.6f}")
# Step 1: loss = 0.480275
# Step 2: loss = 0.453089
# ...
# Step 5: loss = 0.387844 ← decreasing ✓The chain rule: the math behind backprop
Backprop is just the chain rule of calculus applied repeatedly. For a composed function y = f(g(x)):
The chain rule — the fundamental theorem that makes backpropagation work.
For a network with layers L₁, L₂, ..., Lₙ, the gradient of the loss w.r.t. a weight in layer Lᵢ is:
Product of local gradients from output layer back to layer i. This is computed efficiently using dynamic programming (storing intermediate results).
Why backprop is efficient
Without backprop, computing gradients for 1B parameters would require ~2B forward passes. Backprop computes all gradients in 1 forward + 1 backward pass — a 500-million-fold speedup. This is why deep learning became practical.
Automatic differentiation with PyTorch
In practice, you never implement backprop manually. Frameworks like PyTorch use automatic differentiation (autograd) — they track all operations in a computational graph and compute gradients automatically:
PyTorch autograd — backprop in 3 lines
import torch
import torch.nn as nn
# Same 2-layer network, PyTorch style
model = nn.Sequential(
nn.Linear(3, 4), # W1, b1
nn.ReLU(),
nn.Linear(4, 1), # W2, b2
)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()
X = torch.tensor([[1.0, 2.0, 3.0]]) # (1, 3)
y = torch.tensor([[1.0]]) # (1, 1)
for step in range(5):
optimizer.zero_grad() # clear previous gradients
pred = model(X) # forward pass (autograd builds graph)
loss = loss_fn(pred, y) # compute loss
loss.backward() # ← backprop: computes all gradients automatically
optimizer.step() # ← gradient descent: updates all parameters
print(f"Step {step+1}: loss = {loss.item():.6f}")
# Inspect a gradient after backward:
for name, param in model.named_parameters():
print(f"{name}: grad shape = {param.grad.shape}")Vanishing and exploding gradients
In deep networks, gradients are multiplied through many layers. If each layer's local gradient is < 1, the product shrinks exponentially — if > 1, it explodes exponentially:
If each factor is < 1, the product → 0 (vanishing). If > 1, the product → ∞ (exploding).
- Vanishing gradients: Earlier layers learn extremely slowly or not at all. Was the main obstacle to deep networks before 2015. Solutions: ReLU activations (gradient = 1 when active), residual connections (skip connections create gradient highways), proper weight initialization (Xavier, He init).
- Exploding gradients: Loss diverges, NaN parameters. Solution: gradient clipping — cap the gradient norm before updating. Standard in all LLM training.
Gradient clipping — essential for stable LLM training
import torch
# After loss.backward(), before optimizer.step():
max_norm = 1.0 # typical value used in GPT, LLaMA training
# Clip gradients of all parameters
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step()
# This caps the total gradient norm at 1.0
# Individual parameters' gradients are rescaled proportionally
# so no single update step is too largePractice questions
- What is the chain rule and how does backpropagation apply it? (Answer: Chain rule: dL/dx = dL/dy × dy/dx for nested functions y=f(x), L=g(y). Backpropagation systematically applies chain rule through the computational graph: compute loss L, then propagate dL/dW backward through each layer. For a 3-layer network: dL/dW₁ = dL/dA₃ × dA₃/dZ₃ × dZ₃/dA₂ × dA₂/dZ₂ × dZ₂/dA₁ × dA₁/dZ₁ × dZ₁/dW₁. Each multiplication is an application of chain rule through one layer.)
- What is the difference between forward mode and reverse mode automatic differentiation? (Answer: Forward mode (forward-pass AD): compute derivatives w.r.t. one input simultaneously with the forward pass — efficient when # outputs >> # inputs. Reverse mode (backpropagation): compute derivatives of one output w.r.t. all inputs — efficient when # inputs >> # outputs. Neural networks: millions of parameters (inputs), one loss value (output). Reverse mode is O(forward_pass) regardless of parameter count — makes training tractable. Forward mode would require one pass per parameter — O(n_params × forward_pass).)
- What is the vanishing gradient problem in deep networks and how do modern architectures address it? (Answer: Vanishing gradient: gradients diminish exponentially as they backpropagate through many layers (each sigmoid derivative ≤ 0.25). Deep networks (50+ layers) with sigmoid/tanh: gradient at layer 1 ≈ 0.25^50 ≈ 10^-30 — no learning occurs. Solutions: (1) ReLU activations: gradient 1 for positive, 0 for negative — no shrinkage for active neurons. (2) Residual connections (He et al. 2016): direct gradient highways through skip connections. (3) Layer normalisation: prevents extreme pre-activation magnitudes. (4) Batch normalisation.)
- What is gradient explosion and how is gradient clipping different from weight decay in addressing it? (Answer: Gradient explosion: gradients grow exponentially in deep networks — loss becomes NaN. Most common in RNNs. Gradient clipping: scale gradient vector down if its norm exceeds max_norm before each optimizer step. Directly constrains gradient magnitude. Weight decay (L2): adds λ||W||² to loss — pulls weights toward zero during training. Prevents weights from growing large (which can cause large activations and gradients). Different mechanisms: clipping is reactive (after computing gradient); weight decay is preventive (regularises weights throughout training).)
- What is the computational complexity of one backpropagation pass relative to one forward pass? (Answer: Backpropagation is approximately 2× the cost of a forward pass (sometimes 3×). Forward pass: compute layer activations. Backpropagation: (1) Compute output gradients: similar to forward. (2) Compute weight gradients: requires outer products of activation vectors. The 2-3× factor is why training is roughly 3× more expensive than inference per batch. For large models: activations must be stored during the forward pass for use in backpropagation — this is why training requires much more GPU memory than inference.)