Definition

Overfitting occurs when a model learns the training data so well — including noise and random variation — that it fails to generalize to new data. Underfitting occurs when a model is too simple to capture the underlying patterns, performing poorly even on training data. Managing this tradeoff is central to building ML systems that work in the real world.

The bias-variance tradeoff

Any model's error on unseen data decomposes into three terms:

Bias-variance decomposition: the total expected test error is bias squared + variance + irreducible noise. You cannot reduce irreducible noise — it is inherent to the data generation process.

Term	Source	Symptom	Fix
Bias²	Model too simple (linear fit on nonlinear data)	High train error + high test error	More complex model, more features, nonlinear architectures
Variance	Model too complex (memorizes noise)	Low train error, high test error	More data, regularization, dropout, ensembling
Irreducible noise	Inherent noise in the data	Irreducible floor on error	Better data collection, cleaner labels

More data always helps variance, never hurts bias

Collecting more training data reduces variance without changing bias. It is the safest way to improve generalization. Model complexity should be the last thing you increase.

Diagnosing overfitting with learning curves

The clearest signal of overfitting is a growing gap between training and validation loss. Learning curves plot loss vs training examples (or epochs) and reveal both overfitting and underfitting:

Plotting learning curves to diagnose overfitting vs underfitting

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=2000, n_features=20, random_state=42)

model = GradientBoostingClassifier(n_estimators=200, max_depth=5)

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='accuracy'
)

train_mean = train_scores.mean(axis=1)
val_mean   = val_scores.mean(axis=1)
gap        = train_mean - val_mean

# Diagnosis:
# If gap is large and growing: overfitting → regularize or add data
# If both are low and converging: underfitting → more complex model
# If gap is small and both are high: ideal

for size, tr, va, g in zip(train_sizes, train_mean, val_mean, gap):
    print(f"n={size:5.0f}  train={tr:.3f}  val={va:.3f}  gap={g:.3f}")

Epoch-based learning curves (deep learning)

In neural network training, plot train loss and val loss per epoch. If val loss starts rising while train loss keeps falling — that's the overfitting point. Implement early stopping: save the checkpoint at the lowest val loss, not the last epoch.

Regularization techniques

Regularization adds a penalty term to the loss function that discourages overly complex models. The regularized objective becomes:

Regularized loss: λ (lambda) controls the strength of regularization. Higher λ = more penalty on complexity.

L1 (Lasso): drives many weights to exactly zero — automatic feature selection, sparse models. L2 (Ridge): shrinks all weights toward zero, especially effective against multicollinearity.

L1 and L2 regularization in sklearn and PyTorch

# ── sklearn: L1 and L2 in logistic regression ──────────────
from sklearn.linear_model import LogisticRegression

ridge_clf = LogisticRegression(penalty='l2', C=1.0)   # C = 1/lambda
lasso_clf = LogisticRegression(penalty='l1', C=0.1, solver='liblinear')
enet_clf  = LogisticRegression(penalty='elasticnet', C=0.5,
                               l1_ratio=0.5, solver='saga')

# ── PyTorch: weight decay = L2 (implemented in the optimizer) ──
import torch
import torch.nn as nn

model = nn.Linear(20, 1)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-4   # AdamW applies L2 regularization correctly
)

# For L1 regularization in PyTorch (manual):
def l1_penalty(model, lambda_l1=1e-4):
    return lambda_l1 * sum(p.abs().sum() for p in model.parameters())

# loss = criterion(output, target) + l1_penalty(model)

Technique	What it does	Best for
L2 (Ridge / weight decay)	Shrinks all weights toward zero	Default — neural networks, logistic regression
L1 (Lasso)	Zeros out many weights exactly	Feature selection, sparse linear models
Elastic Net	L1 + L2 combined	When both sparsity and stability are needed
Dropout	Randomly zeroes neurons per step	Neural networks — equivalent to implicit ensemble
Early stopping	Halts training at best val loss	Neural networks — most practical regularizer
Data augmentation	Adds synthetic training examples	Images, audio, text — very effective

Data augmentation to fight overfitting

Data augmentation artificially expands the training set by applying label-preserving transformations — teaching the model that certain variations don't change the underlying category:

Image augmentation with torchvision transforms

from torchvision import transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader

# Standard augmentation pipeline for image classification
train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),         # random spatial crop
    transforms.RandomHorizontalFlip(p=0.5),       # 50% chance of flip
    transforms.ColorJitter(brightness=0.2,
                           contrast=0.2,
                           saturation=0.2),       # color perturbation
    transforms.RandomRotation(15),                # rotate ±15°
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),  # ImageNet stats
])

# At test time — no augmentation, only normalize
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

train_dataset = CIFAR10(root='./data', train=True,  transform=train_transform)
test_dataset  = CIFAR10(root='./data', train=False, transform=test_transform)

Mixup and CutMix

Advanced augmentation: Mixup creates training examples by linearly interpolating two images and their labels. CutMix pastes a patch from one image onto another. Both consistently reduce overfitting and improve calibration on ImageNet benchmarks.

Overfitting in LLMs: memorization

LLMs face a unique form of overfitting: verbatim memorization of training examples. Carlini et al. (2021) showed that GPT-2 could reproduce exact passages from its training set when given a prefix:

Detecting memorization: membership inference test (simplified)

# Researchers test memorization by measuring log-probability of known
# training examples vs non-training examples.

def compute_log_prob(model, tokenizer, text: str) -> float:
    """Higher log-prob → model has likely seen this text during training."""
    import torch
    tokens = tokenizer.encode(text, return_tensors="pt")

    with torch.no_grad():
        output = model(tokens, labels=tokens)

    return -output.loss.item()   # negative cross-entropy = log probability

training_example = "This specific sentence appeared in the training corpus verbatim..."
new_example      = "This sentence was never seen during training..."

lp_train = compute_log_prob(model, tokenizer, training_example)
lp_new   = compute_log_prob(model, tokenizer, new_example)

print(f"Training example log-prob: {lp_train:.3f}")  # much higher
print(f"New example log-prob:      {lp_new:.3f}")    # lower

Memorization risk factor	Mitigation
Exact duplicates in training data	Deduplicate training corpus before training
Small dataset with many epochs	Limit epochs, use early stopping
Sensitive data in pretraining	Differential privacy (DP-SGD), data audits
Prompts that trigger memorized sequences	Post-training safety filtering, RLHF

Practice questions

What is the mathematical relationship between model complexity and generalisation error? (Answer: Bias-variance decomposition: Expected_Error = Bias² + Variance + σ². Bias decreases with complexity (model fits training data better). Variance increases with complexity (model is sensitive to training data specifics). Optimal complexity minimises total expected error at the bias-variance trade-off point. Regularisation (L1, L2, dropout) constrains complexity, reducing variance at the cost of slightly higher bias — usually improving generalisation.)
What is the difference between L1 and L2 regularisation in terms of the solutions they produce? (Answer: L2 (Ridge): adds λΣwᵢ² to loss — penalises large weights, keeps all features but shrinks coefficients toward zero. Closed-form solution. Encourages smooth, distributed representations. L1 (Lasso): adds λΣ|wᵢ| — produces sparse solutions (many exactly-zero weights). Acts as feature selection. No closed-form solution (non-differentiable at 0). Use L1 when: few features are expected to be relevant. Use L2 when: all features may contribute. Elastic Net combines both.)
What is data augmentation and why is it especially effective for vision models? (Answer: Data augmentation: generate new training examples by applying label-preserving transformations to existing data — crops, flips, rotations, colour jitter, cutout, mixup, CutMix for images; back-translation, synonym substitution for text. Effective for vision because: (1) Image invariances (a cat rotated 30° is still a cat) provide free regularisation. (2) Dramatically increases effective dataset size. (3) Teaches invariances that the model should exhibit. ImageNet models trained with aggressive augmentation (RandAugment) achieve significantly better generalisation.)
What is the difference between overfitting and underfitting in terms of the training curve? (Answer: Underfitting: high training loss AND high validation loss — model hasn't learned the data patterns. Training curve flat and above optimal performance. Overfitting: low training loss but high validation loss — model memorised training data. Training curve improves while validation curve plateaus then worsens. Correct fitting: both curves improve together and converge to similar values. Early stopping uses the validation curve to catch the moment validation loss starts increasing — stopping training at the optimal generalisation point.)
What is double descent in modern deep learning and why does it challenge the traditional bias-variance trade-off? (Answer: Classical view: model complexity → U-shaped test error (underfitting → optimal → overfitting). Double descent (Belkin et al. 2019): test error decreases, then increases (classical overfitting), then decreases AGAIN as model becomes massively overparameterised. The second descent occurs when the model has enough capacity to interpolate the training data AND still generalise. Neural networks and transformers with billions of parameters are in this 'benign overfitting' regime — they memorise training data yet generalise well, contradicting classical intuition.)

Overfitting & Underfitting

The bias-variance tradeoff

Diagnosing overfitting with learning curves

Regularization techniques

Data augmentation to fight overfitting

Overfitting in LLMs: memorization

Practice questions

Try LumiChats for ₹69

Related Terms