Regularisation is any technique that prevents overfitting by adding constraints or penalties to the learning process. L1 (Lasso) and L2 (Ridge) add parameter penalties to the loss function. Early stopping halts training when validation performance starts degrading — preventing the model from memorising training noise. Dropout randomly deactivates neurons during training, preventing co-adaptation and acting as an ensemble of thousands of sparse networks. Elastic Net combines L1 and L2. These techniques are critical for practical ML — a model that memorises training data is useless in production.
Early stopping — the simplest regulariser
During training, plot both training loss and validation loss per epoch. Training loss always decreases. Validation loss initially decreases (model learns useful patterns) then starts increasing (model memorises training-specific noise). The point of minimum validation loss is the optimal stopping point — early stopping saves the model checkpoint at this point and stops training.
Early stopping in PyTorch and sklearn
import torch
import torch.nn as nn
import numpy as np
# PyTorch manual early stopping
class EarlyStopping:
def __init__(self, patience=10, min_delta=1e-4, restore_best=True):
self.patience = patience
self.min_delta = min_delta
self.restore_best = restore_best
self.best_loss = float('inf')
self.counter = 0
self.best_weights = None
def step(self, val_loss, model):
if val_loss < self.best_loss - self.min_delta:
self.best_loss = val_loss
self.counter = 0
if self.restore_best:
self.best_weights = {k: v.clone() for k, v in model.state_dict().items()}
else:
self.counter += 1
print(f" No improvement {self.counter}/{self.patience}")
if self.counter >= self.patience:
if self.restore_best and self.best_weights:
model.load_state_dict(self.best_weights)
print(f"Restored best weights (val_loss={self.best_loss:.4f})")
return True # Stop training
return False # Continue
# Training loop with early stopping
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
es = EarlyStopping(patience=15, restore_best=True)
for epoch in range(500):
# train_loss = ... (your training step)
# val_loss = ... (validation step)
if es.step(val_loss, model):
print(f"Early stopping at epoch {epoch}")
break
# sklearn: early_stopping parameter (Neural Network, XGBoost)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(
hidden_layer_sizes=(100,),
max_iter=500,
early_stopping=True, # Enable early stopping
validation_fraction=0.1, # 10% of training data for validation
n_iter_no_change=10, # Stop after 10 iterations without improvement
tol=1e-4,
random_state=42
)
# XGBoost early stopping
import xgboost as xgb
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)
xgb_model = xgb.XGBClassifier(n_estimators=1000, random_state=42, verbosity=0)
xgb_model.fit(X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50, # Stop if no improvement for 50 rounds
verbose=False)
print(f"Best iteration: {xgb_model.best_iteration}")Dropout — training multiple sparse networks
Dropout randomly sets a fraction p of neuron outputs to zero during each forward pass in training. Each mini-batch trains a different sparse sub-network. At inference, all neurons are active but scaled by (1-p). This prevents neurons from co-adapting (relying on specific other neurons) and forces each neuron to learn robust features independently. Effectively trains an ensemble of 2^n sub-networks.
Dropout in neural networks with PyTorch
import torch
import torch.nn as nn
class RegularisedNet(nn.Module):
def __init__(self, input_size, hidden_sizes, n_classes, dropout_rate=0.5):
super().__init__()
layers = []
prev_size = input_size
for h_size in hidden_sizes:
layers += [
nn.Linear(prev_size, h_size),
nn.BatchNorm1d(h_size), # Batch normalisation (also regularises)
nn.ReLU(),
nn.Dropout(p=dropout_rate) # Dropout after activation
]
prev_size = h_size
layers.append(nn.Linear(prev_size, n_classes))
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
# CRITICAL: Dropout behaves differently during training vs inference
model = RegularisedNet(20, [256, 128, 64], 2, dropout_rate=0.3)
model.train() # Training: dropout ACTIVE — randomly zeroes neurons
output_train = model(torch.randn(32, 20))
model.eval() # Inference: dropout INACTIVE — all neurons active, scaled by (1-p)
with torch.no_grad():
output_eval = model(torch.randn(32, 20))
# Common dropout rates:
# 0.1-0.2: light regularisation (input layers, small models)
# 0.3-0.5: standard (hidden layers in larger models)
# 0.5: Hinton's original recommendation for hidden layers
# 0.8: for inputs (NLP tasks — 80% of words kept per step)
# Monte Carlo Dropout (MC Dropout): uncertainty estimation
# Keep dropout active at inference — run N forward passes, compute variance
model.train() # Keep dropout active for MC Dropout
mc_predictions = torch.stack([model(X_test) for _ in range(100)])
mean_pred = mc_predictions.mean(0)
uncertainty = mc_predictions.std(0) # High std = uncertain prediction| Regularisation | Mechanism | Effect | Best for |
|---|---|---|---|
| L1 (Lasso) | Adds λΣ|βⱼ| to loss | Sparse weights — zero out irrelevant features | Feature selection, sparse models |
| L2 (Ridge) | Adds λΣβⱼ² to loss | Shrinks weights toward zero | Correlated features, general regression |
| Elastic Net | Adds both L1 + L2 | Sparse + shrinkage | Many features, some correlated |
| Dropout | Randomly zero neurons during training | Prevents co-adaptation, implicit ensemble | Deep neural networks |
| Early Stopping | Stop at min validation loss | Prevents memorising training noise | Any iterative training |
| Batch Norm | Normalise activations per batch | Reduces internal covariate shift | Deep networks (faster training) |
Elastic Net — combining L1 and L2
Elastic Net combines L1 sparsity and L2 stability. L1 ratio = λ₁/(λ₁+λ₂). L1 ratio=1 → pure Lasso. L1 ratio=0 → pure Ridge. Use Elastic Net when you have many features, some correlated, and want sparsity + stability. Best with l1_ratio=0.5 as starting point.
Elastic Net with cross-validated hyperparameter selection
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.datasets import make_regression
import numpy as np
X, y, coef = make_regression(n_samples=200, n_features=100, n_informative=20,
noise=10, coef=True, random_state=42)
# 100 features, only 20 truly relevant
# ElasticNetCV: auto-select alpha and l1_ratio via cross-validation
enc = ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0], # Grid of l1_ratio values
alphas=np.logspace(-3, 1, 30), # Grid of alpha (regularisation strength)
cv=5,
max_iter=10000
)
enc.fit(X, y)
print(f"Best alpha: {enc.alpha_:.4f}")
print(f"Best l1_ratio: {enc.l1_ratio_:.2f}")
print(f"Non-zero coefs: {(enc.coef_ != 0).sum()} / 100") # Should be ~20
from sklearn.metrics import r2_score
print(f"R² on training: {r2_score(y, enc.predict(X)):.3f}")Practice questions
- At what epoch should you stop training based on: train_loss decreases every epoch, val_loss stops decreasing at epoch 50 and starts increasing at epoch 70? (Answer: Stop at epoch 50 — the point of minimum validation loss. Epochs 50-70 the model starts overfitting (val_loss increases). Early stopping with patience=20 would trigger at epoch 70 and restore weights from epoch 50.)
- Dropout rate p=0.5 during training. At inference time, what happens to the neuron outputs? (Answer: At inference, dropout is disabled (all neurons active). Outputs are NOT halved — either (1) inverted dropout: multiply by 1/(1-p)=2 during training (PyTorch default), or (2) scale by (1-p)=0.5 at inference. Both are equivalent; PyTorch uses inverted dropout.)
- Why does L1 regularisation create sparse models while L2 does not? (Answer: Geometric: L1 constraint region is a diamond (corners on axes) — loss contours touch corners where some β=0. L2 constraint is a sphere (smooth, no corners) — loss contours touch it at a non-zero point. Analytically: L1 gradient is ±λ (constant), which can zero out small coefficients. L2 gradient is 2λβ (shrinks toward zero but never reaches it for continuous optimisation).)
- When should you use Elastic Net instead of Lasso or Ridge? (Answer: Elastic Net when: (1) You have highly correlated features (Lasso arbitrarily picks one; Elastic Net keeps groups). (2) p >> n (more features than samples) — Lasso can only select min(n,p) features; Elastic Net can select more. (3) You want sparsity (some β=0) but also stability of Ridge.)
- Batch Normalisation also acts as regularisation. How? (Answer: BatchNorm normalises activations using mini-batch statistics, introducing noise because the mean and variance are computed on a small batch, not the full dataset. This noise acts as regularisation — similar to dropout. Also stabilises training by reducing internal covariate shift, allowing higher learning rates.)
On LumiChats
LumiChats can identify which regularisation technique is most appropriate for your model architecture and dataset size, implement early stopping callbacks, tune dropout rates, and explain why your model is overfitting or underfitting based on training/validation curves.
Try it free