Definition

Supervised learning is the most common ML paradigm, where a model learns from a dataset of input-output pairs (labeled examples). The model learns to map inputs to outputs by minimizing the difference between its predictions and the correct labels. Examples: classifying emails as spam/not-spam, predicting house prices from features, recognizing handwritten digits.

Classification vs regression

Supervised learning covers two fundamental task types, distinguished by the output type:

Task	Output type	Loss function	Examples
Binary classification	One of two classes (0 or 1)	Binary cross-entropy	Spam detection, fraud detection
Multi-class classification	One of K classes	Categorical cross-entropy	Image recognition (1000 classes), digit recognition
Multi-label classification	Any subset of K classes	Binary cross-entropy per label	Emotion detection, topic tagging
Regression	Continuous numerical value	MSE or MAE	House price, stock prediction, age estimation

Categorical cross-entropy loss for multi-class classification. y_c = 1 for the correct class, 0 otherwise. Minimizing this maximizes the predicted probability of the correct class.

Mean Squared Error for regression. Penalizes large errors quadratically — a 2× larger error gives 4× the penalty. Use MAE instead when outliers should have less influence.

Core algorithms

The workhorse supervised learning algorithms and their key math:

Logistic Regression: linear model + sigmoid output. Despite the name, it is a binary classifier. w and b are learned by maximizing log-likelihood (minimizing binary cross-entropy).

Ensemble method (e.g. Random Forest, Gradient Boosting): prediction is a weighted sum of K base learners h_k. Each base learner is a shallow decision tree.

Comparing key supervised learning algorithms on a classification task

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import numpy as np

X, y = make_classification(n_samples=2000, n_features=20,
                            n_informative=10, random_state=42)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest":       RandomForestClassifier(n_estimators=100),
    "Gradient Boosting":   GradientBoostingClassifier(n_estimators=100),
    "SVM (RBF kernel)":    SVC(kernel='rbf', C=1.0),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:25s}: {scores.mean():.3f} ± {scores.std():.3f}")

# Logistic Regression    : 0.854 ± 0.018   ← fast, interpretable, linear
# Random Forest          : 0.907 ± 0.013   ← handles nonlinearity, robust
# Gradient Boosting      : 0.921 ± 0.009   ← usually best on tabular data
# SVM (RBF kernel)       : 0.893 ± 0.014   ← strong with proper tuning

Practical rule

For tabular data, always try XGBoost or LightGBM first — they're state-of-the-art and require minimal preprocessing. For text, images, or audio, use pretrained neural networks. For very small datasets (<500 samples), logistic regression or SVM often beats complex models.

The training / validation / test split

Properly evaluating models requires strict data separation. The golden rule: the test set must never influence any training or model selection decision.

Proper train/val/test split with sklearn

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# ── Step 1: split FIRST, preprocess second ──────────────
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp)
# 0.176 of 0.85 ≈ 0.15 of total → final split: 70% train, 15% val, 15% test

# ── Step 2: fit scaler on TRAINING data only ─────────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform
X_val_scaled   = scaler.transform(X_val)          # transform only (no fit!)
X_test_scaled  = scaler.transform(X_test)         # transform only (no fit!)

# WRONG (data leakage): scaler.fit_transform(X) before splitting
# This leaks test set statistics into training — optimistic estimates

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

Split	Typical size	Purpose
Training	70–80%	Model learns from this. Hyperparameters affect training.
Validation	10–15%	Tune hyperparameters, select model, early stopping.
Test	10–15%	Final unbiased evaluation. Touch only once at the very end.

Key metrics for evaluation

Accuracy alone is misleading for imbalanced datasets. A model predicting 'no fraud' always gets 99.9% accuracy on fraud data — but has zero utility. Use task-appropriate metrics:

Precision: of all predicted positives, how many are correct? Recall: of all actual positives, how many did we find? There is always a precision-recall tradeoff.

F1 score: harmonic mean of precision and recall. F_β generalizes this: higher β weights recall more (useful when false negatives are more costly).

Comprehensive classification report with all key metrics

from sklearn.metrics import (classification_report, roc_auc_score,
                              confusion_matrix, average_precision_score)
import numpy as np

# Assume y_test (true labels) and y_pred, y_prob (predicted labels + probabilities)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]   # probability of positive class

# Full classification report
print(classification_report(y_test, y_pred))
# Output:
#              precision  recall  f1-score  support
# class 0       0.93      0.95     0.94      150
# class 1       0.91      0.87     0.89      100
# accuracy                         0.92      250

# Additional metrics for imbalanced classes
print(f"AUC-ROC:              {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision:    {average_precision_score(y_test, y_prob):.4f}")
print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}")

Metric	Best for	Interpretation
Accuracy	Balanced datasets	% of all predictions that are correct
Precision	When false positives are costly (spam filter)	% of predicted positives that are truly positive
Recall	When false negatives are costly (cancer screening)	% of actual positives that are detected
F1	Imbalanced, equal FP/FN cost	Harmonic mean of precision and recall
AUC-ROC	Ranking quality, any imbalance	Prob. that model ranks positive above negative (1.0 = perfect)
RMSE	Regression	Error in original units; penalizes outliers heavily

Data quality is more important than algorithm choice

A common misconception: the algorithm drives model quality. In practice, data quality dominates by a wide margin. Clean, representative, correctly-labeled data with a 'good' algorithm consistently beats complex algorithms trained on poor data.

Data quality issue	Effect	Detection
Label noise (5% mislabeled)	Significant performance drop, high variance	Cleanlab, manual review of confident wrong predictions
Distribution shift (train ≠ test)	Models fails silently in production	Compare feature distributions with KS test or MMD
Class imbalance (99:1)	Model ignores minority class	Check per-class metrics, not just accuracy
Selection bias	Model learns spurious correlations	Audit data collection process; holdout from different source
Target leakage	Artificially inflated test scores	Feature importance audit; temporal split for time data

The 80/20 rule of ML

In production ML projects, ~80% of time is spent on data — collection, cleaning, labeling, feature engineering. The model itself is 20%. Invest heavily in data quality before experimenting with model complexity.

Practice questions

What is the difference between classification and regression as supervised learning tasks? (Answer: Classification: target Y is categorical — predict which class an example belongs to. Binary (spam/not spam) or multi-class (10 digits). Loss: cross-entropy. Output layer: softmax (multi-class) or sigmoid (binary). Regression: target Y is continuous — predict a real-valued quantity (house price, temperature). Loss: MSE or MAE. Output layer: linear (no activation). Some tasks are borderline: predicting a score (1–5 stars) can be classification or ordinal regression depending on the modelling assumption.)
What is empirical risk minimisation (ERM) and what are its limitations? (Answer: ERM: minimise the average loss on the training set: θ* = argmin_θ (1/n)Σℓ(f_θ(xᵢ), yᵢ). Simple and computationally tractable. Limitations: (1) Overfitting: minimising training loss ≠ minimising test loss. (2) Distribution shift: training and test distributions may differ. (3) Label noise: ERM directly fits noisy labels. (4) Memorisation: on small datasets, ERM may memorise rather than generalise. Regularisation (add ||θ||² to loss) extends ERM to penalise complexity.)
What is the train/validation/test split and why is it critical not to use the test set for model selection? (Answer: Train (60–70%): fit model parameters. Validation (15–20%): select hyperparameters, compare models, choose architecture. Test (15–20%): final unbiased estimate of generalisation. If you use the test set to select models (compare multiple models, choose the best), you are leaking test information into model selection — the test set is no longer held-out. The selected model will appear better than it truly is. The test set should be used EXACTLY ONCE — after all development decisions are made.)
What is k-fold cross-validation and when should you use it instead of a fixed train/test split? (Answer: K-fold CV: split data into k equal parts; train on k-1 parts, test on the remaining part; rotate k times; average k test scores. Use when: dataset is small (<5000 examples) and a fixed test set would have high variance estimates. Provides: lower-variance performance estimate, uses all data for both training and evaluation. Computationally expensive (k× training runs). For large datasets (>100k examples): fixed split is sufficient and cross-validation adds unnecessary compute cost.)
What is the difference between online learning and batch learning in supervised settings? (Answer: Batch learning: train on the entire training set simultaneously. Model is static after deployment. Common for most ML. Online learning: update the model one example (or mini-batch) at a time as data arrives. Handles non-stationary distributions (model adapts as the world changes). Examples: online gradient descent, Vowpal Wabbit, streaming perceptron. Required for: real-time systems (fraud detection adapting to new fraud patterns), very large datasets that don't fit in memory, systems where training data arrives continuously.)

Supervised Learning

Classification vs regression

Core algorithms

The training / validation / test split

Key metrics for evaluation

Data quality is more important than algorithm choice

Practice questions

Try LumiChats for ₹69

Related Terms