Definition

Regularisation is any technique that prevents overfitting by adding constraints or penalties to the learning process. L1 (Lasso) and L2 (Ridge) add parameter penalties to the loss function. Early stopping halts training when validation performance starts degrading — preventing the model from memorising training noise. Dropout randomly deactivates neurons during training, preventing co-adaptation and acting as an ensemble of thousands of sparse networks. Elastic Net combines L1 and L2. These techniques are critical for practical ML — a model that memorises training data is useless in production.

Early stopping — the simplest regulariser

During training, plot both training loss and validation loss per epoch. Training loss always decreases. Validation loss initially decreases (model learns useful patterns) then starts increasing (model memorises training-specific noise). The point of minimum validation loss is the optimal stopping point — early stopping saves the model checkpoint at this point and stops training.

Early stopping in PyTorch and sklearn

import torch
import torch.nn as nn
import numpy as np

# PyTorch manual early stopping
class EarlyStopping:
    def __init__(self, patience=10, min_delta=1e-4, restore_best=True):
        self.patience      = patience
        self.min_delta     = min_delta
        self.restore_best  = restore_best
        self.best_loss     = float('inf')
        self.counter       = 0
        self.best_weights  = None

    def step(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss    = val_loss
            self.counter      = 0
            if self.restore_best:
                self.best_weights = {k: v.clone() for k, v in model.state_dict().items()}
        else:
            self.counter += 1
            print(f"  No improvement {self.counter}/{self.patience}")

        if self.counter >= self.patience:
            if self.restore_best and self.best_weights:
                model.load_state_dict(self.best_weights)
                print(f"Restored best weights (val_loss={self.best_loss:.4f})")
            return True   # Stop training
        return False       # Continue

# Training loop with early stopping
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
es = EarlyStopping(patience=15, restore_best=True)

for epoch in range(500):
    # train_loss = ... (your training step)
    # val_loss   = ... (validation step)
    if es.step(val_loss, model):
        print(f"Early stopping at epoch {epoch}")
        break

# sklearn: early_stopping parameter (Neural Network, XGBoost)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),
    max_iter=500,
    early_stopping=True,        # Enable early stopping
    validation_fraction=0.1,    # 10% of training data for validation
    n_iter_no_change=10,        # Stop after 10 iterations without improvement
    tol=1e-4,
    random_state=42
)

# XGBoost early stopping
import xgboost as xgb
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)
xgb_model = xgb.XGBClassifier(n_estimators=1000, random_state=42, verbosity=0)
xgb_model.fit(X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,    # Stop if no improvement for 50 rounds
    verbose=False)
print(f"Best iteration: {xgb_model.best_iteration}")

Dropout — training multiple sparse networks

Dropout randomly sets a fraction p of neuron outputs to zero during each forward pass in training. Each mini-batch trains a different sparse sub-network. At inference, all neurons are active but scaled by (1-p). This prevents neurons from co-adapting (relying on specific other neurons) and forces each neuron to learn robust features independently. Effectively trains an ensemble of 2^n sub-networks.

Dropout in neural networks with PyTorch

import torch
import torch.nn as nn

class RegularisedNet(nn.Module):
    def __init__(self, input_size, hidden_sizes, n_classes, dropout_rate=0.5):
        super().__init__()
        layers = []
        prev_size = input_size
        for h_size in hidden_sizes:
            layers += [
                nn.Linear(prev_size, h_size),
                nn.BatchNorm1d(h_size),          # Batch normalisation (also regularises)
                nn.ReLU(),
                nn.Dropout(p=dropout_rate)        # Dropout after activation
            ]
            prev_size = h_size
        layers.append(nn.Linear(prev_size, n_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

# CRITICAL: Dropout behaves differently during training vs inference
model = RegularisedNet(20, [256, 128, 64], 2, dropout_rate=0.3)

model.train()    # Training: dropout ACTIVE — randomly zeroes neurons
output_train = model(torch.randn(32, 20))

model.eval()     # Inference: dropout INACTIVE — all neurons active, scaled by (1-p)
with torch.no_grad():
    output_eval = model(torch.randn(32, 20))

# Common dropout rates:
# 0.1-0.2: light regularisation (input layers, small models)
# 0.3-0.5: standard (hidden layers in larger models)
# 0.5:     Hinton's original recommendation for hidden layers
# 0.8:     for inputs (NLP tasks — 80% of words kept per step)

# Monte Carlo Dropout (MC Dropout): uncertainty estimation
# Keep dropout active at inference — run N forward passes, compute variance
model.train()   # Keep dropout active for MC Dropout
mc_predictions = torch.stack([model(X_test) for _ in range(100)])
mean_pred  = mc_predictions.mean(0)
uncertainty = mc_predictions.std(0)   # High std = uncertain prediction

Regularisation	Mechanism	Effect	Best for
L1 (Lasso)	Adds λΣ\|βⱼ\| to loss	Sparse weights — zero out irrelevant features	Feature selection, sparse models
L2 (Ridge)	Adds λΣβⱼ² to loss	Shrinks weights toward zero	Correlated features, general regression
Elastic Net	Adds both L1 + L2	Sparse + shrinkage	Many features, some correlated
Dropout	Randomly zero neurons during training	Prevents co-adaptation, implicit ensemble	Deep neural networks
Early Stopping	Stop at min validation loss	Prevents memorising training noise	Any iterative training
Batch Norm	Normalise activations per batch	Reduces internal covariate shift	Deep networks (faster training)

Elastic Net — combining L1 and L2

Elastic Net combines L1 sparsity and L2 stability. L1 ratio = λ₁/(λ₁+λ₂). L1 ratio=1 → pure Lasso. L1 ratio=0 → pure Ridge. Use Elastic Net when you have many features, some correlated, and want sparsity + stability. Best with l1_ratio=0.5 as starting point.

Elastic Net with cross-validated hyperparameter selection

from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.datasets import make_regression
import numpy as np

X, y, coef = make_regression(n_samples=200, n_features=100, n_informative=20,
                              noise=10, coef=True, random_state=42)
# 100 features, only 20 truly relevant

# ElasticNetCV: auto-select alpha and l1_ratio via cross-validation
enc = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],  # Grid of l1_ratio values
    alphas=np.logspace(-3, 1, 30),  # Grid of alpha (regularisation strength)
    cv=5,
    max_iter=10000
)
enc.fit(X, y)

print(f"Best alpha:     {enc.alpha_:.4f}")
print(f"Best l1_ratio:  {enc.l1_ratio_:.2f}")
print(f"Non-zero coefs: {(enc.coef_ != 0).sum()} / 100")   # Should be ~20

from sklearn.metrics import r2_score
print(f"R² on training: {r2_score(y, enc.predict(X)):.3f}")

Practice questions

At what epoch should you stop training based on: train_loss decreases every epoch, val_loss stops decreasing at epoch 50 and starts increasing at epoch 70? (Answer: Stop at epoch 50 — the point of minimum validation loss. Epochs 50-70 the model starts overfitting (val_loss increases). Early stopping with patience=20 would trigger at epoch 70 and restore weights from epoch 50.)
Dropout rate p=0.5 during training. At inference time, what happens to the neuron outputs? (Answer: At inference, dropout is disabled (all neurons active). Outputs are NOT halved — either (1) inverted dropout: multiply by 1/(1-p)=2 during training (PyTorch default), or (2) scale by (1-p)=0.5 at inference. Both are equivalent; PyTorch uses inverted dropout.)
Why does L1 regularisation create sparse models while L2 does not? (Answer: Geometric: L1 constraint region is a diamond (corners on axes) — loss contours touch corners where some β=0. L2 constraint is a sphere (smooth, no corners) — loss contours touch it at a non-zero point. Analytically: L1 gradient is ±λ (constant), which can zero out small coefficients. L2 gradient is 2λβ (shrinks toward zero but never reaches it for continuous optimisation).)
When should you use Elastic Net instead of Lasso or Ridge? (Answer: Elastic Net when: (1) You have highly correlated features (Lasso arbitrarily picks one; Elastic Net keeps groups). (2) p >> n (more features than samples) — Lasso can only select min(n,p) features; Elastic Net can select more. (3) You want sparsity (some β=0) but also stability of Ridge.)
Batch Normalisation also acts as regularisation. How? (Answer: BatchNorm normalises activations using mini-batch statistics, introducing noise because the mean and variance are computed on a small batch, not the full dataset. This noise acts as regularisation — similar to dropout. Also stabilises training by reducing internal covariate shift, allowing higher learning rates.)

On LumiChats

LumiChats can identify which regularisation technique is most appropriate for your model architecture and dataset size, implement early stopping callbacks, tune dropout rates, and explain why your model is overfitting or underfitting based on training/validation curves.

Try it free

Early Stopping, Dropout & Regularisation Concepts

Early stopping — the simplest regulariser

Dropout — training multiple sparse networks

Elastic Net — combining L1 and L2

Practice questions

Try LumiChats for ₹69

Related Terms