The learning rate controls how large a step gradient descent takes at each iteration. Too large: oscillates and diverges. Too small: converges painfully slowly. SGD (Stochastic Gradient Descent) updates parameters using one example at a time. Mini-batch SGD uses small batches. Momentum accumulates past gradients for smoother updates. Adam combines adaptive learning rates with momentum and is the default optimiser for most deep learning. Choosing and tuning the optimiser is among the most impactful decisions in model training.
Real-life analogy: Walking down a mountain blindfolded
Imagine descending a mountain blindfolded, feeling only the slope under your feet. Gradient descent: always step in the steepest downhill direction. Learning rate: how large each step is. Too large = you might step over the valley into another hill. Too small = you take forever. Momentum: you build speed in consistent downhill directions, avoiding zig-zagging in narrow valleys. Adam: you automatically adjust your step size per direction — tiny steps in steep areas, larger steps in flat areas.
SGD and mini-batch gradient descent
| Variant | Batch size | Updates per epoch | Noise | Memory | Best for |
|---|---|---|---|---|---|
| Batch GD | All n examples | 1 | None (exact gradient) | High — needs full dataset | Small datasets, convex problems |
| Stochastic GD (SGD) | 1 example | n | Very high | O(1) | Online learning, huge datasets |
| Mini-batch GD | 32–512 examples | n/batch_size | Moderate (beneficial) | Low | Standard deep learning — best balance |
SGD variants comparison with PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# Simple linear regression problem
torch.manual_seed(42)
X = torch.randn(1000, 5)
y = X @ torch.tensor([2., -1., 0.5, 3., -2.]) + 0.1 * torch.randn(1000)
model_sgd = nn.Linear(5, 1)
model_adam = nn.Linear(5, 1)
model_mom = nn.Linear(5, 1)
loss_fn = nn.MSELoss()
# Different optimisers
opt_sgd = optim.SGD(model_sgd.parameters(), lr=0.01)
opt_adam = optim.Adam(model_adam.parameters(), lr=0.001) # Default β1=0.9, β2=0.999
opt_mom = optim.SGD(model_mom.parameters(), lr=0.01, momentum=0.9)
dataset = torch.utils.data.TensorDataset(X, y.unsqueeze(1))
loader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
for epoch in range(10):
losses = {'sgd': [], 'adam': [], 'momentum': []}
for X_batch, y_batch in loader:
for model, opt, name in [(model_sgd, opt_sgd, 'sgd'),
(model_adam, opt_adam, 'adam'),
(model_mom, opt_mom, 'momentum')]:
opt.zero_grad()
loss = loss_fn(model(X_batch), y_batch)
loss.backward()
opt.step()
losses[name].append(loss.item())
if epoch % 2 == 0:
for name in losses:
print(f"Epoch {epoch} {name}: {np.mean(losses[name]):.4f}")Adam optimiser — the modern default
Adam: m_t = first moment (momentum, default β₁=0.9). v_t = second moment (adaptive learning rate, default β₂=0.999). Bias-corrected: m̂_t = m_t/(1-β₁ᵗ). Effective learning rate = α/√v̂ — large for rarely updated parameters, small for frequently updated ones.
Learning rate schedulers and warmup
import torch.optim as optim
from torch.optim.lr_scheduler import (StepLR, CosineAnnealingLR,
ReduceLROnPlateau, OneCycleLR)
model = nn.Linear(10, 1)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# 1. Step decay: reduce LR by gamma every step_size epochs
scheduler_step = StepLR(optimizer, step_size=10, gamma=0.5)
# LR: epoch 0-9: 1e-3, epoch 10-19: 5e-4, epoch 20-29: 2.5e-4...
# 2. Cosine annealing: smoothly reduce LR to min over T_max epochs
scheduler_cos = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
# 3. Reduce on plateau: reduce when metric stops improving
scheduler_plateau = ReduceLROnPlateau(optimizer, mode='min',
factor=0.5, patience=5, min_lr=1e-7)
# 4. OneCycle: warm up then anneal (best for fast training)
scheduler_one = OneCycleLR(optimizer, max_lr=1e-2,
steps_per_epoch=100, epochs=10)
# Training loop with scheduler
for epoch in range(100):
train_loss = train_one_epoch() # Your training function
scheduler_plateau.step(train_loss) # Pass val loss to plateau scheduler
# For step/cosine: call after each epoch
scheduler_step.step()
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch}: LR = {current_lr:.2e}, Loss = {train_loss:.4f}")
# Popular optimisers and when to use:
# SGD + Momentum: theoretical advantages (generalisation), used for CV models
# Adam: fastest convergence, default for NLP, transformers, mixed results on CV
# AdamW: Adam + weight decay decoupled (better generalisation) — GPT-4, Claude
# RMSprop: good for RNNs, similar to Adam without first moment
# Adagrad: sparse data, NLP, adapts per-parameter (but LR shrinks to zero)Practice questions
- Learning rate of 10.0 vs 0.000001 — what happens with each? (Answer: LR=10: huge steps overshoot the minimum, oscillate wildly, may diverge (loss increases). LR=0.000001: infinitesimally small steps, learning is correct but takes millions of iterations to converge — impractically slow.)
- Why does SGD noise (from using one example at a time) sometimes help? (Answer: Noise helps escape local minima and saddle points — random perturbations can kick the optimiser out of flat regions. SGD noise also acts as implicit regularisation, often finding flatter minima that generalise better than the sharp minima that batch GD tends to find.)
- Adam uses β₁=0.9 and β₂=0.999. What do these hyperparameters control? (Answer: β₁=0.9: exponential decay rate for first moment (gradient momentum) — 90% of past gradients kept, 10% of current gradient. β₂=0.999: decay rate for second moment (gradient variance) — slow-moving estimate of per-parameter gradient squared. Higher = smoother, more history retained.)
- What is the difference between AdaGrad and Adam regarding learning rate decay? (Answer: AdaGrad accumulates all past squared gradients — learning rate shrinks monotonically and eventually reaches near-zero (learning stops). Adam uses exponential moving average of squared gradients — old information decays away, preventing the learning rate from shrinking to zero.)
- Why is learning rate warmup used in transformer training? (Answer: At the start of training, the model is randomly initialised — gradients are noisy and large. A large learning rate immediately would cause destructive updates. Warmup linearly increases LR from 0 to target over the first 1000-10000 steps, letting the model stabilise before taking large steps.)
On LumiChats
LumiChats can help you choose the right optimiser and learning rate for your specific model and dataset, debug slow convergence or loss spikes, and implement learning rate scheduling strategies in PyTorch or TensorFlow.
Try it free