Glossary/Early Stopping, Dropout & Regularisation Concepts
Machine Learning

Early Stopping, Dropout & Regularisation Concepts

Preventing models from memorising training data — the key to generalisation.


Definition

Regularisation is any technique that prevents overfitting by adding constraints or penalties to the learning process. L1 (Lasso) and L2 (Ridge) add parameter penalties to the loss function. Early stopping halts training when validation performance starts degrading — preventing the model from memorising training noise. Dropout randomly deactivates neurons during training, preventing co-adaptation and acting as an ensemble of thousands of sparse networks. Elastic Net combines L1 and L2. These techniques are critical for practical ML — a model that memorises training data is useless in production.

Early stopping — the simplest regulariser

During training, plot both training loss and validation loss per epoch. Training loss always decreases. Validation loss initially decreases (model learns useful patterns) then starts increasing (model memorises training-specific noise). The point of minimum validation loss is the optimal stopping point — early stopping saves the model checkpoint at this point and stops training.

Early stopping in PyTorch and sklearn

import torch
import torch.nn as nn
import numpy as np

# PyTorch manual early stopping
class EarlyStopping:
    def __init__(self, patience=10, min_delta=1e-4, restore_best=True):
        self.patience      = patience
        self.min_delta     = min_delta
        self.restore_best  = restore_best
        self.best_loss     = float('inf')
        self.counter       = 0
        self.best_weights  = None

    def step(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss    = val_loss
            self.counter      = 0
            if self.restore_best:
                self.best_weights = {k: v.clone() for k, v in model.state_dict().items()}
        else:
            self.counter += 1
            print(f"  No improvement {self.counter}/{self.patience}")

        if self.counter >= self.patience:
            if self.restore_best and self.best_weights:
                model.load_state_dict(self.best_weights)
                print(f"Restored best weights (val_loss={self.best_loss:.4f})")
            return True   # Stop training
        return False       # Continue

# Training loop with early stopping
model = nn.Sequential(nn.Linear(20, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
es = EarlyStopping(patience=15, restore_best=True)

for epoch in range(500):
    # train_loss = ... (your training step)
    # val_loss   = ... (validation step)
    if es.step(val_loss, model):
        print(f"Early stopping at epoch {epoch}")
        break

# sklearn: early_stopping parameter (Neural Network, XGBoost)
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(
    hidden_layer_sizes=(100,),
    max_iter=500,
    early_stopping=True,        # Enable early stopping
    validation_fraction=0.1,    # 10% of training data for validation
    n_iter_no_change=10,        # Stop after 10 iterations without improvement
    tol=1e-4,
    random_state=42
)

# XGBoost early stopping
import xgboost as xgb
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)
xgb_model = xgb.XGBClassifier(n_estimators=1000, random_state=42, verbosity=0)
xgb_model.fit(X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,    # Stop if no improvement for 50 rounds
    verbose=False)
print(f"Best iteration: {xgb_model.best_iteration}")

Dropout — training multiple sparse networks

Dropout randomly sets a fraction p of neuron outputs to zero during each forward pass in training. Each mini-batch trains a different sparse sub-network. At inference, all neurons are active but scaled by (1-p). This prevents neurons from co-adapting (relying on specific other neurons) and forces each neuron to learn robust features independently. Effectively trains an ensemble of 2^n sub-networks.

Dropout in neural networks with PyTorch

import torch
import torch.nn as nn

class RegularisedNet(nn.Module):
    def __init__(self, input_size, hidden_sizes, n_classes, dropout_rate=0.5):
        super().__init__()
        layers = []
        prev_size = input_size
        for h_size in hidden_sizes:
            layers += [
                nn.Linear(prev_size, h_size),
                nn.BatchNorm1d(h_size),          # Batch normalisation (also regularises)
                nn.ReLU(),
                nn.Dropout(p=dropout_rate)        # Dropout after activation
            ]
            prev_size = h_size
        layers.append(nn.Linear(prev_size, n_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

# CRITICAL: Dropout behaves differently during training vs inference
model = RegularisedNet(20, [256, 128, 64], 2, dropout_rate=0.3)

model.train()    # Training: dropout ACTIVE — randomly zeroes neurons
output_train = model(torch.randn(32, 20))

model.eval()     # Inference: dropout INACTIVE — all neurons active, scaled by (1-p)
with torch.no_grad():
    output_eval = model(torch.randn(32, 20))

# Common dropout rates:
# 0.1-0.2: light regularisation (input layers, small models)
# 0.3-0.5: standard (hidden layers in larger models)
# 0.5:     Hinton's original recommendation for hidden layers
# 0.8:     for inputs (NLP tasks — 80% of words kept per step)

# Monte Carlo Dropout (MC Dropout): uncertainty estimation
# Keep dropout active at inference — run N forward passes, compute variance
model.train()   # Keep dropout active for MC Dropout
mc_predictions = torch.stack([model(X_test) for _ in range(100)])
mean_pred  = mc_predictions.mean(0)
uncertainty = mc_predictions.std(0)   # High std = uncertain prediction
RegularisationMechanismEffectBest for
L1 (Lasso)Adds λΣ|βⱼ| to lossSparse weights — zero out irrelevant featuresFeature selection, sparse models
L2 (Ridge)Adds λΣβⱼ² to lossShrinks weights toward zeroCorrelated features, general regression
Elastic NetAdds both L1 + L2Sparse + shrinkageMany features, some correlated
DropoutRandomly zero neurons during trainingPrevents co-adaptation, implicit ensembleDeep neural networks
Early StoppingStop at min validation lossPrevents memorising training noiseAny iterative training
Batch NormNormalise activations per batchReduces internal covariate shiftDeep networks (faster training)

Elastic Net — combining L1 and L2

Elastic Net combines L1 sparsity and L2 stability. L1 ratio = λ₁/(λ₁+λ₂). L1 ratio=1 → pure Lasso. L1 ratio=0 → pure Ridge. Use Elastic Net when you have many features, some correlated, and want sparsity + stability. Best with l1_ratio=0.5 as starting point.

Elastic Net with cross-validated hyperparameter selection

from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.datasets import make_regression
import numpy as np

X, y, coef = make_regression(n_samples=200, n_features=100, n_informative=20,
                              noise=10, coef=True, random_state=42)
# 100 features, only 20 truly relevant

# ElasticNetCV: auto-select alpha and l1_ratio via cross-validation
enc = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],  # Grid of l1_ratio values
    alphas=np.logspace(-3, 1, 30),  # Grid of alpha (regularisation strength)
    cv=5,
    max_iter=10000
)
enc.fit(X, y)

print(f"Best alpha:     {enc.alpha_:.4f}")
print(f"Best l1_ratio:  {enc.l1_ratio_:.2f}")
print(f"Non-zero coefs: {(enc.coef_ != 0).sum()} / 100")   # Should be ~20

from sklearn.metrics import r2_score
print(f"R² on training: {r2_score(y, enc.predict(X)):.3f}")

Practice questions

  1. At what epoch should you stop training based on: train_loss decreases every epoch, val_loss stops decreasing at epoch 50 and starts increasing at epoch 70? (Answer: Stop at epoch 50 — the point of minimum validation loss. Epochs 50-70 the model starts overfitting (val_loss increases). Early stopping with patience=20 would trigger at epoch 70 and restore weights from epoch 50.)
  2. Dropout rate p=0.5 during training. At inference time, what happens to the neuron outputs? (Answer: At inference, dropout is disabled (all neurons active). Outputs are NOT halved — either (1) inverted dropout: multiply by 1/(1-p)=2 during training (PyTorch default), or (2) scale by (1-p)=0.5 at inference. Both are equivalent; PyTorch uses inverted dropout.)
  3. Why does L1 regularisation create sparse models while L2 does not? (Answer: Geometric: L1 constraint region is a diamond (corners on axes) — loss contours touch corners where some β=0. L2 constraint is a sphere (smooth, no corners) — loss contours touch it at a non-zero point. Analytically: L1 gradient is ±λ (constant), which can zero out small coefficients. L2 gradient is 2λβ (shrinks toward zero but never reaches it for continuous optimisation).)
  4. When should you use Elastic Net instead of Lasso or Ridge? (Answer: Elastic Net when: (1) You have highly correlated features (Lasso arbitrarily picks one; Elastic Net keeps groups). (2) p >> n (more features than samples) — Lasso can only select min(n,p) features; Elastic Net can select more. (3) You want sparsity (some β=0) but also stability of Ridge.)
  5. Batch Normalisation also acts as regularisation. How? (Answer: BatchNorm normalises activations using mini-batch statistics, introducing noise because the mean and variance are computed on a small batch, not the full dataset. This noise acts as regularisation — similar to dropout. Also stabilises training by reducing internal covariate shift, allowing higher learning rates.)

On LumiChats

LumiChats can identify which regularisation technique is most appropriate for your model architecture and dataset size, implement early stopping callbacks, tune dropout rates, and explain why your model is overfitting or underfitting based on training/validation curves.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms