Training dynamics describes how a model's performance evolves during training. Key factors: batch size (how many examples per gradient update), epochs (how many times to pass over the full dataset), and learning rate scheduling (how to adjust step size over time). The loss landscape is the high-dimensional surface the optimizer navigates — understanding its geometry (flat minima, sharp minima, saddle points) explains why some training configs generalise better than others. Monitoring loss curves and gradient norms is essential for diagnosing training problems early.
Batch size, epochs, and effective training time
Training loop with monitoring, gradient accumulation, and scheduling
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
import numpy as np
model = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 1))
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
loss_fn = nn.MSELoss()
writer = SummaryWriter("runs/training_dynamics")
# ── Key training hyperparameters ──
BATCH_SIZE = 32 # Gradients computed from 32 examples per step
ACCUM_STEPS = 4 # Accumulate for 4 steps → effective batch = 128
MAX_EPOCHS = 100
LR = 1e-3
# Learning rate schedule: warmup + cosine decay
total_steps = MAX_EPOCHS * len(train_loader)
warmup_steps = total_steps // 10
from torch.optim.lr_scheduler import OneCycleLR, CosineAnnealingLR, LinearLR, SequentialLR
# Warmup: LR linearly increases from 0 to LR for first 10% of steps
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_steps)
# Decay: LR cosine anneals from LR to 0 for remaining steps
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-6)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine],
milestones=[warmup_steps])
train_losses, val_losses, grad_norms = [], [], []
for epoch in range(MAX_EPOCHS):
model.train()
epoch_loss = 0
optimizer.zero_grad() # Zero gradients at start of epoch
for step, (X, y) in enumerate(train_loader):
# Forward pass
pred = model(X)
loss = loss_fn(pred, y) / ACCUM_STEPS # Scale loss for accumulation
# Backward pass
loss.backward()
# Only update every ACCUM_STEPS steps (gradient accumulation)
if (step + 1) % ACCUM_STEPS == 0:
# Monitor gradient norm BEFORE clipping
total_norm = nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
grad_norms.append(total_norm.item())
optimizer.step()
scheduler.step()
optimizer.zero_grad()
epoch_loss += loss.item() * ACCUM_STEPS
# Validation
model.eval()
with torch.no_grad():
val_loss = sum(loss_fn(model(X_v), y_v).item() for X_v, y_v in val_loader)
val_loss /= len(val_loader)
avg_train_loss = epoch_loss / len(train_loader)
train_losses.append(avg_train_loss)
val_losses.append(val_loss)
# Log to TensorBoard
writer.add_scalars("Loss", {"train": avg_train_loss, "val": val_loss}, epoch)
writer.add_scalar("LR", optimizer.param_groups[0]["lr"], epoch)
writer.add_scalar("GradNorm", np.mean(grad_norms[-len(train_loader):]), epoch)
if epoch % 10 == 0:
print(f"Epoch {epoch:3d}: train={avg_train_loss:.4f} val={val_loss:.4f} "
f"lr={optimizer.param_groups[0]['lr']:.2e}")
# ── Diagnosing from loss curves ──
gap = val_losses[-1] - train_losses[-1]
if train_losses[-1] > 0.3:
print("Underfitting: train loss still high. Train longer or use larger model.")
elif gap > 0.2:
print("Overfitting: large train-val gap. Add regularisation or more data.")
else:
print("Good fit: small train-val gap with low loss.")Loss landscapes and generalisation
Flat vs sharp minima: The loss landscape has valleys (minima) with different shapes. Sharp minima: narrow valley — tiny perturbations to weights cause large loss increase. Models in sharp minima often overfit. Flat minima: wide valley — small weight changes cause little loss change. Models in flat minima generalise better because the valley corresponds to a robust solution. Small batch size and noise from SGD naturally push toward flatter minima.
| Scenario | Diagnosis | Fix |
|---|---|---|
| Train↓ Val↓ (gap closing) | Both improving — normal training | Continue training, monitor for overfitting |
| Train↓ Val→ plateau (gap) | Overfitting begins | Regularise (dropout, weight decay), more data |
| Train↓ Val↑ (crossing) | Overfitting — past optimal | Early stopping. Save model from before crossing. |
| Train→ Val→ (both stuck) | Underfitting or too low LR | Increase LR, train longer, larger model |
| Train/Val oscillating wildly | LR too high | Reduce learning rate, add warmup |
| Loss → NaN | Exploding gradients or LR too high | Gradient clipping, reduce LR, check for inf in data |
Practice questions
- You train for 100 epochs on a dataset of 10,000 examples with batch size 32. How many gradient updates total? (Answer: Steps per epoch = 10,000/32 = 312.5 ≈ 312. Total updates = 312 × 100 = 31,200 gradient updates. Each update uses the average gradient over 32 examples.)
- Why do models in flat minima generalise better than models in sharp minima? (Answer: Flat minima correspond to parameter regions where the loss is insensitive to small perturbations. When the model is deployed on slightly different data (test set), the parameter perturbations it encounters do not cause large loss increases. Sharp minima are sensitive — small distribution shifts cause catastrophic performance drops.)
- gradient_accumulation_steps=4 with batch_size=8. What is the effective batch size? (Answer: Effective batch = 8 × 4 = 32. Gradients are computed for 8 examples per step, accumulated (summed) over 4 steps, then parameters are updated with the accumulated gradient — mathematically equivalent to computing the gradient over 32 examples at once.)
- Loss is 0.001 on training set but 2.5 on validation set. What is the issue and what are three fixes? (Answer: Severe overfitting. Fixes: (1) Reduce model complexity (fewer layers/neurons). (2) Add regularisation (dropout, L2 weight decay). (3) Get more training data or use augmentation. (4) Early stopping — save model from earlier epoch when val loss was lower. (5) Use LoRA for LLMs — fewer trainable params = less overfitting.)
- What does a warmup schedule do and why is it used for LLM fine-tuning? (Answer: Warmup linearly increases LR from near-0 to target LR over the first 5-10% of training steps. At the start of fine-tuning, weights are not calibrated for the new task — large LR immediately would cause destructive updates. Warmup lets the model stabilise before taking large steps. Especially important for large LLMs where the pretrained weights contain valuable knowledge that must be preserved.)
On LumiChats
Every LLM fine-tuning run (including the Unsloth notebooks) needs careful monitoring of loss curves. LumiChats can help you diagnose training problems: paste your training logs and ask 'Why is my validation loss increasing?' or 'Is my model overfitting?' — with specific actionable fixes.
Try it free