Overfitting occurs when a model learns the training data so well — including noise and random variation — that it fails to generalize to new data. Underfitting occurs when a model is too simple to capture the underlying patterns, performing poorly even on training data. Managing this tradeoff is central to building ML systems that work in the real world.
The bias-variance tradeoff
Any model's error on unseen data decomposes into three terms:
Bias-variance decomposition: the total expected test error is bias squared + variance + irreducible noise. You cannot reduce irreducible noise — it is inherent to the data generation process.
| Term | Source | Symptom | Fix |
|---|---|---|---|
| Bias² | Model too simple (linear fit on nonlinear data) | High train error + high test error | More complex model, more features, nonlinear architectures |
| Variance | Model too complex (memorizes noise) | Low train error, high test error | More data, regularization, dropout, ensembling |
| Irreducible noise | Inherent noise in the data | Irreducible floor on error | Better data collection, cleaner labels |
More data always helps variance, never hurts bias
Collecting more training data reduces variance without changing bias. It is the safest way to improve generalization. Model complexity should be the last thing you increase.
Diagnosing overfitting with learning curves
The clearest signal of overfitting is a growing gap between training and validation loss. Learning curves plot loss vs training examples (or epochs) and reveal both overfitting and underfitting:
Plotting learning curves to diagnose overfitting vs underfitting
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
model = GradientBoostingClassifier(n_estimators=200, max_depth=5)
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='accuracy'
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
gap = train_mean - val_mean
# Diagnosis:
# If gap is large and growing: overfitting → regularize or add data
# If both are low and converging: underfitting → more complex model
# If gap is small and both are high: ideal
for size, tr, va, g in zip(train_sizes, train_mean, val_mean, gap):
print(f"n={size:5.0f} train={tr:.3f} val={va:.3f} gap={g:.3f}")Epoch-based learning curves (deep learning)
In neural network training, plot train loss and val loss per epoch. If val loss starts rising while train loss keeps falling — that's the overfitting point. Implement early stopping: save the checkpoint at the lowest val loss, not the last epoch.
Regularization techniques
Regularization adds a penalty term to the loss function that discourages overly complex models. The regularized objective becomes:
Regularized loss: λ (lambda) controls the strength of regularization. Higher λ = more penalty on complexity.
L1 (Lasso): drives many weights to exactly zero — automatic feature selection, sparse models. L2 (Ridge): shrinks all weights toward zero, especially effective against multicollinearity.
L1 and L2 regularization in sklearn and PyTorch
# ── sklearn: L1 and L2 in logistic regression ──────────────
from sklearn.linear_model import LogisticRegression
ridge_clf = LogisticRegression(penalty='l2', C=1.0) # C = 1/lambda
lasso_clf = LogisticRegression(penalty='l1', C=0.1, solver='liblinear')
enet_clf = LogisticRegression(penalty='elasticnet', C=0.5,
l1_ratio=0.5, solver='saga')
# ── PyTorch: weight decay = L2 (implemented in the optimizer) ──
import torch
import torch.nn as nn
model = nn.Linear(20, 1)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
weight_decay=1e-4 # AdamW applies L2 regularization correctly
)
# For L1 regularization in PyTorch (manual):
def l1_penalty(model, lambda_l1=1e-4):
return lambda_l1 * sum(p.abs().sum() for p in model.parameters())
# loss = criterion(output, target) + l1_penalty(model)| Technique | What it does | Best for |
|---|---|---|
| L2 (Ridge / weight decay) | Shrinks all weights toward zero | Default — neural networks, logistic regression |
| L1 (Lasso) | Zeros out many weights exactly | Feature selection, sparse linear models |
| Elastic Net | L1 + L2 combined | When both sparsity and stability are needed |
| Dropout | Randomly zeroes neurons per step | Neural networks — equivalent to implicit ensemble |
| Early stopping | Halts training at best val loss | Neural networks — most practical regularizer |
| Data augmentation | Adds synthetic training examples | Images, audio, text — very effective |
Data augmentation to fight overfitting
Data augmentation artificially expands the training set by applying label-preserving transformations — teaching the model that certain variations don't change the underlying category:
Image augmentation with torchvision transforms
from torchvision import transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
# Standard augmentation pipeline for image classification
train_transform = transforms.Compose([
transforms.RandomCrop(32, padding=4), # random spatial crop
transforms.RandomHorizontalFlip(p=0.5), # 50% chance of flip
transforms.ColorJitter(brightness=0.2,
contrast=0.2,
saturation=0.2), # color perturbation
transforms.RandomRotation(15), # rotate ±15°
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]), # ImageNet stats
])
# At test time — no augmentation, only normalize
test_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
train_dataset = CIFAR10(root='./data', train=True, transform=train_transform)
test_dataset = CIFAR10(root='./data', train=False, transform=test_transform)Mixup and CutMix
Advanced augmentation: Mixup creates training examples by linearly interpolating two images and their labels. CutMix pastes a patch from one image onto another. Both consistently reduce overfitting and improve calibration on ImageNet benchmarks.
Overfitting in LLMs: memorization
LLMs face a unique form of overfitting: verbatim memorization of training examples. Carlini et al. (2021) showed that GPT-2 could reproduce exact passages from its training set when given a prefix:
Detecting memorization: membership inference test (simplified)
# Researchers test memorization by measuring log-probability of known
# training examples vs non-training examples.
def compute_log_prob(model, tokenizer, text: str) -> float:
"""Higher log-prob → model has likely seen this text during training."""
import torch
tokens = tokenizer.encode(text, return_tensors="pt")
with torch.no_grad():
output = model(tokens, labels=tokens)
return -output.loss.item() # negative cross-entropy = log probability
training_example = "This specific sentence appeared in the training corpus verbatim..."
new_example = "This sentence was never seen during training..."
lp_train = compute_log_prob(model, tokenizer, training_example)
lp_new = compute_log_prob(model, tokenizer, new_example)
print(f"Training example log-prob: {lp_train:.3f}") # much higher
print(f"New example log-prob: {lp_new:.3f}") # lower| Memorization risk factor | Mitigation |
|---|---|
| Exact duplicates in training data | Deduplicate training corpus before training |
| Small dataset with many epochs | Limit epochs, use early stopping |
| Sensitive data in pretraining | Differential privacy (DP-SGD), data audits |
| Prompts that trigger memorized sequences | Post-training safety filtering, RLHF |
Practice questions
- What is the mathematical relationship between model complexity and generalisation error? (Answer: Bias-variance decomposition: Expected_Error = Bias² + Variance + σ². Bias decreases with complexity (model fits training data better). Variance increases with complexity (model is sensitive to training data specifics). Optimal complexity minimises total expected error at the bias-variance trade-off point. Regularisation (L1, L2, dropout) constrains complexity, reducing variance at the cost of slightly higher bias — usually improving generalisation.)
- What is the difference between L1 and L2 regularisation in terms of the solutions they produce? (Answer: L2 (Ridge): adds λΣwᵢ² to loss — penalises large weights, keeps all features but shrinks coefficients toward zero. Closed-form solution. Encourages smooth, distributed representations. L1 (Lasso): adds λΣ|wᵢ| — produces sparse solutions (many exactly-zero weights). Acts as feature selection. No closed-form solution (non-differentiable at 0). Use L1 when: few features are expected to be relevant. Use L2 when: all features may contribute. Elastic Net combines both.)
- What is data augmentation and why is it especially effective for vision models? (Answer: Data augmentation: generate new training examples by applying label-preserving transformations to existing data — crops, flips, rotations, colour jitter, cutout, mixup, CutMix for images; back-translation, synonym substitution for text. Effective for vision because: (1) Image invariances (a cat rotated 30° is still a cat) provide free regularisation. (2) Dramatically increases effective dataset size. (3) Teaches invariances that the model should exhibit. ImageNet models trained with aggressive augmentation (RandAugment) achieve significantly better generalisation.)
- What is the difference between overfitting and underfitting in terms of the training curve? (Answer: Underfitting: high training loss AND high validation loss — model hasn't learned the data patterns. Training curve flat and above optimal performance. Overfitting: low training loss but high validation loss — model memorised training data. Training curve improves while validation curve plateaus then worsens. Correct fitting: both curves improve together and converge to similar values. Early stopping uses the validation curve to catch the moment validation loss starts increasing — stopping training at the optimal generalisation point.)
- What is double descent in modern deep learning and why does it challenge the traditional bias-variance trade-off? (Answer: Classical view: model complexity → U-shaped test error (underfitting → optimal → overfitting). Double descent (Belkin et al. 2019): test error decreases, then increases (classical overfitting), then decreases AGAIN as model becomes massively overparameterised. The second descent occurs when the model has enough capacity to interpolate the training data AND still generalise. Neural networks and transformers with billions of parameters are in this 'benign overfitting' regime — they memorise training data yet generalise well, contradicting classical intuition.)