Glossary/Supervised Learning
Machine Learning

Supervised Learning

Learning from labeled examples.


Definition

Supervised learning is the most common ML paradigm, where a model learns from a dataset of input-output pairs (labeled examples). The model learns to map inputs to outputs by minimizing the difference between its predictions and the correct labels. Examples: classifying emails as spam/not-spam, predicting house prices from features, recognizing handwritten digits.

Classification vs regression

Supervised learning covers two fundamental task types, distinguished by the output type:

TaskOutput typeLoss functionExamples
Binary classificationOne of two classes (0 or 1)Binary cross-entropySpam detection, fraud detection
Multi-class classificationOne of K classesCategorical cross-entropyImage recognition (1000 classes), digit recognition
Multi-label classificationAny subset of K classesBinary cross-entropy per labelEmotion detection, topic tagging
RegressionContinuous numerical valueMSE or MAEHouse price, stock prediction, age estimation

Categorical cross-entropy loss for multi-class classification. y_c = 1 for the correct class, 0 otherwise. Minimizing this maximizes the predicted probability of the correct class.

Mean Squared Error for regression. Penalizes large errors quadratically — a 2× larger error gives 4× the penalty. Use MAE instead when outliers should have less influence.

Core algorithms

The workhorse supervised learning algorithms and their key math:

Logistic Regression: linear model + sigmoid output. Despite the name, it is a binary classifier. w and b are learned by maximizing log-likelihood (minimizing binary cross-entropy).

Ensemble method (e.g. Random Forest, Gradient Boosting): prediction is a weighted sum of K base learners h_k. Each base learner is a shallow decision tree.

Comparing key supervised learning algorithms on a classification task

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import numpy as np

X, y = make_classification(n_samples=2000, n_features=20,
                            n_informative=10, random_state=42)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest":       RandomForestClassifier(n_estimators=100),
    "Gradient Boosting":   GradientBoostingClassifier(n_estimators=100),
    "SVM (RBF kernel)":    SVC(kernel='rbf', C=1.0),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:25s}: {scores.mean():.3f} ± {scores.std():.3f}")

# Logistic Regression    : 0.854 ± 0.018   ← fast, interpretable, linear
# Random Forest          : 0.907 ± 0.013   ← handles nonlinearity, robust
# Gradient Boosting      : 0.921 ± 0.009   ← usually best on tabular data
# SVM (RBF kernel)       : 0.893 ± 0.014   ← strong with proper tuning

Practical rule

For tabular data, always try XGBoost or LightGBM first — they're state-of-the-art and require minimal preprocessing. For text, images, or audio, use pretrained neural networks. For very small datasets (<500 samples), logistic regression or SVM often beats complex models.

The training / validation / test split

Properly evaluating models requires strict data separation. The golden rule: the test set must never influence any training or model selection decision.

Proper train/val/test split with sklearn

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# ── Step 1: split FIRST, preprocess second ──────────────
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp)
# 0.176 of 0.85 ≈ 0.15 of total → final split: 70% train, 15% val, 15% test

# ── Step 2: fit scaler on TRAINING data only ─────────────
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # fit + transform
X_val_scaled   = scaler.transform(X_val)          # transform only (no fit!)
X_test_scaled  = scaler.transform(X_test)         # transform only (no fit!)

# WRONG (data leakage): scaler.fit_transform(X) before splitting
# This leaks test set statistics into training — optimistic estimates

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")
SplitTypical sizePurpose
Training70–80%Model learns from this. Hyperparameters affect training.
Validation10–15%Tune hyperparameters, select model, early stopping.
Test10–15%Final unbiased evaluation. Touch only once at the very end.

Key metrics for evaluation

Accuracy alone is misleading for imbalanced datasets. A model predicting 'no fraud' always gets 99.9% accuracy on fraud data — but has zero utility. Use task-appropriate metrics:

Precision: of all predicted positives, how many are correct? Recall: of all actual positives, how many did we find? There is always a precision-recall tradeoff.

F1 score: harmonic mean of precision and recall. F_β generalizes this: higher β weights recall more (useful when false negatives are more costly).

Comprehensive classification report with all key metrics

from sklearn.metrics import (classification_report, roc_auc_score,
                              confusion_matrix, average_precision_score)
import numpy as np

# Assume y_test (true labels) and y_pred, y_prob (predicted labels + probabilities)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]   # probability of positive class

# Full classification report
print(classification_report(y_test, y_pred))
# Output:
#              precision  recall  f1-score  support
# class 0       0.93      0.95     0.94      150
# class 1       0.91      0.87     0.89      100
# accuracy                         0.92      250

# Additional metrics for imbalanced classes
print(f"AUC-ROC:              {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision:    {average_precision_score(y_test, y_prob):.4f}")
print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred)}")
MetricBest forInterpretation
AccuracyBalanced datasets% of all predictions that are correct
PrecisionWhen false positives are costly (spam filter)% of predicted positives that are truly positive
RecallWhen false negatives are costly (cancer screening)% of actual positives that are detected
F1Imbalanced, equal FP/FN costHarmonic mean of precision and recall
AUC-ROCRanking quality, any imbalanceProb. that model ranks positive above negative (1.0 = perfect)
RMSERegressionError in original units; penalizes outliers heavily

Data quality is more important than algorithm choice

A common misconception: the algorithm drives model quality. In practice, data quality dominates by a wide margin. Clean, representative, correctly-labeled data with a 'good' algorithm consistently beats complex algorithms trained on poor data.

Data quality issueEffectDetection
Label noise (5% mislabeled)Significant performance drop, high varianceCleanlab, manual review of confident wrong predictions
Distribution shift (train ≠ test)Models fails silently in productionCompare feature distributions with KS test or MMD
Class imbalance (99:1)Model ignores minority classCheck per-class metrics, not just accuracy
Selection biasModel learns spurious correlationsAudit data collection process; holdout from different source
Target leakageArtificially inflated test scoresFeature importance audit; temporal split for time data

The 80/20 rule of ML

In production ML projects, ~80% of time is spent on data — collection, cleaning, labeling, feature engineering. The model itself is 20%. Invest heavily in data quality before experimenting with model complexity.

Practice questions

  1. What is the difference between classification and regression as supervised learning tasks? (Answer: Classification: target Y is categorical — predict which class an example belongs to. Binary (spam/not spam) or multi-class (10 digits). Loss: cross-entropy. Output layer: softmax (multi-class) or sigmoid (binary). Regression: target Y is continuous — predict a real-valued quantity (house price, temperature). Loss: MSE or MAE. Output layer: linear (no activation). Some tasks are borderline: predicting a score (1–5 stars) can be classification or ordinal regression depending on the modelling assumption.)
  2. What is empirical risk minimisation (ERM) and what are its limitations? (Answer: ERM: minimise the average loss on the training set: θ* = argmin_θ (1/n)Σℓ(f_θ(xᵢ), yᵢ). Simple and computationally tractable. Limitations: (1) Overfitting: minimising training loss ≠ minimising test loss. (2) Distribution shift: training and test distributions may differ. (3) Label noise: ERM directly fits noisy labels. (4) Memorisation: on small datasets, ERM may memorise rather than generalise. Regularisation (add ||θ||² to loss) extends ERM to penalise complexity.)
  3. What is the train/validation/test split and why is it critical not to use the test set for model selection? (Answer: Train (60–70%): fit model parameters. Validation (15–20%): select hyperparameters, compare models, choose architecture. Test (15–20%): final unbiased estimate of generalisation. If you use the test set to select models (compare multiple models, choose the best), you are leaking test information into model selection — the test set is no longer held-out. The selected model will appear better than it truly is. The test set should be used EXACTLY ONCE — after all development decisions are made.)
  4. What is k-fold cross-validation and when should you use it instead of a fixed train/test split? (Answer: K-fold CV: split data into k equal parts; train on k-1 parts, test on the remaining part; rotate k times; average k test scores. Use when: dataset is small (<5000 examples) and a fixed test set would have high variance estimates. Provides: lower-variance performance estimate, uses all data for both training and evaluation. Computationally expensive (k× training runs). For large datasets (>100k examples): fixed split is sufficient and cross-validation adds unnecessary compute cost.)
  5. What is the difference between online learning and batch learning in supervised settings? (Answer: Batch learning: train on the entire training set simultaneously. Model is static after deployment. Common for most ML. Online learning: update the model one example (or mini-batch) at a time as data arrives. Handles non-stationary distributions (model adapts as the world changes). Examples: online gradient descent, Vowpal Wabbit, streaming perceptron. Required for: real-time systems (fraud detection adapting to new fraud patterns), very large datasets that don't fit in memory, systems where training data arrives continuously.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms