Glossary/Cross-Validation
Machine Learning

Cross-Validation

The gold standard for honest model evaluation.


Definition

Cross-validation is a technique for evaluating ML model performance that uses the training data more efficiently than a single train/test split. By repeatedly training and evaluating on different subsets of the data, it provides a more reliable estimate of generalization performance — especially important when training data is limited.

K-fold cross-validation

K-fold CV provides a robust generalization estimate by using every data point exactly once as a validation example. The final metric is the mean ± standard deviation across k folds:

CV score: average performance across k folds. f_{-i} = model trained on all folds except fold i. D_i = validation fold i. The std across k scores gives a confidence interval on performance.

Proper 5-fold stratified cross-validation with multiple metrics

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=1000, n_features=20,
                            n_informative=10, weights=[0.7, 0.3],  # imbalanced!
                            random_state=42)

model = GradientBoostingClassifier(n_estimators=100, max_depth=3)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Stratified: preserves class ratio in each fold — essential for imbalanced data

results = cross_validate(
    model, X, y,
    cv=cv,
    scoring=['accuracy', 'f1', 'roc_auc'],
    return_train_score=True
)

for metric in ['accuracy', 'f1', 'roc_auc']:
    val = results[f'test_{metric}']
    print(f"{metric:12s}: {val.mean():.4f} ± {val.std():.4f}  "
          f"(train: {results[f'train_{metric}'].mean():.4f})")

# accuracy    : 0.8710 ± 0.0142  (train: 0.9982)  ← overfitting!
# f1          : 0.7893 ± 0.0231  (train: 0.9964)
# roc_auc     : 0.9251 ± 0.0189  (train: 0.9999)

Why stratified CV matters

If your dataset has 90% class 0 and 10% class 1, random folds might have 0 or very few class 1 examples — leading to meaningless metrics. StratifiedKFold guarantees each fold has the same class ratio as the full dataset. Always use it for classification.

Leave-one-out CV (LOOCV)

LOOCV is the extreme case k=n: each single example is the entire validation set for one training run. This maximizes training data per fold but has high computational cost and high-variance estimates:

Methodk valueTraining data per foldCompute costEstimate varianceBest use case
Hold-out1 fold70-80%1× (cheapest)HighLarge datasets (n > 100K)
5-fold CVk=580%MediumStandard — good balance
10-fold CVk=1090%10×LowerWhen evaluation precision matters
LOOCVk=n(n-1)/nn× (expensive)High (single-example test)Very small datasets (n < 100)

LOOCV — only practical for small datasets or fast models

from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# LOOCV is practical only for small n or very fast models
X_small = X[:100]   # 100 examples
y_small = y[:100]

loo = LeaveOneOut()
model = KNeighborsClassifier(n_neighbors=5)

scores = cross_val_score(model, X_small, y_small,
                         cv=loo, scoring='accuracy')

print(f"LOOCV: {scores.mean():.4f} ± {scores.std():.4f}")
print(f"Number of folds: {loo.get_n_splits(X_small)}")  # = 100

# Each test set = 1 example → high variance in individual scores (0 or 1)
# but averaged over 100 folds gives a low-bias estimate

Time-series cross-validation

Standard k-fold randomly shuffles data — this causes data leakage in time series, where future information would incorrectly be used to predict the past. Time-series CV must respect temporal order:

Walk-forward (expanding window) cross-validation for time series

from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np

# Simulated time series data: predict next day's value
n = 500
time_steps = np.arange(n)
y_ts = np.sin(time_steps / 20) + np.random.normal(0, 0.1, n)

# Create lag features (past 5 values as features)
X_ts = np.column_stack([y_ts[i:n-5+i] for i in range(5)])
y_ts_target = y_ts[5:]

tscv = TimeSeriesSplit(n_splits=5, gap=0)
model = GradientBoostingRegressor(n_estimators=100)

scores = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(X_ts)):
    X_tr, X_va = X_ts[train_idx], X_ts[val_idx]
    y_tr, y_va = y_ts_target[train_idx], y_ts_target[val_idx]

    model.fit(X_tr, y_tr)
    score = model.score(X_va, y_va)   # R²
    scores.append(score)
    print(f"Fold {fold+1}: train={len(train_idx)}, val={len(val_idx)}, R²={score:.4f}")

print(f"Mean R²: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

Look-ahead bias kills time-series models

Using a future value to predict the past is the worst form of data leakage — the model appears to have 95% accuracy in validation but fails completely in production. Always: (1) sort by time, (2) split so validation is strictly after training, (3) respect any embargo/gap between train end and val start.

Common validation mistakes (data leakage)

Data leakage is when information from the test/validation set illegally flows into training — producing artificially optimistic performance estimates that evaporate in production.

Leakage typeHow it happensFix
Preprocessing leakageFit scaler/encoder on full dataset before splittingAlways split first, then fit preprocessing on train only
Target leakageFeature that is only available after the target is knownAudit features — remove any that wouldn't be available at prediction time
Look-ahead (time series)Training on data from the future, testing on the pastAlways use walk-forward or expanding window CV
Train-test contaminationTest examples accidentally appear in training dataDeduplicate datasets, audit data pipelines
Proxy leakageFeature is a proxy for the label (not the label itself)Domain knowledge audit — check feature-target correlations carefully

Detecting preprocessing leakage with a pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, random_state=42)

# ✅ CORRECT: scaler inside pipeline — fit only on train fold each time
correct_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])
correct_scores = cross_val_score(correct_pipeline, X, y, cv=5)

# ❌ WRONG: scaler fit on full X before CV — leaks test statistics
from sklearn.preprocessing import StandardScaler
X_scaled_wrong = StandardScaler().fit_transform(X)   # sees ALL data!
wrong_scores = cross_val_score(LogisticRegression(), X_scaled_wrong, y, cv=5)

print(f"Correct (no leakage): {correct_scores.mean():.4f}")
print(f"Wrong (leakage):      {wrong_scores.mean():.4f}")
# The gap reveals how much the leakage inflated scores

Model selection with cross-validation

CV enables principled hyperparameter tuning and model selection without touching the test set. Nested CV is the gold standard for unbiased model selection:

Nested cross-validation for unbiased model selection

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=500, n_features=20, random_state=42)

# ── Nested CV: outer loop evaluates, inner loop selects ─────────
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv  = KFold(n_splits=3, shuffle=True, random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 3, 5],
    'learning_rate': [0.05, 0.1, 0.2],
}

# Inner CV selects best hyperparameters; outer CV evaluates the selection process
model = GridSearchCV(
    GradientBoostingClassifier(),
    param_grid=param_grid,
    cv=inner_cv,
    scoring='accuracy',
    n_jobs=-1
)

# Outer loop gives unbiased estimate of selected model's true performance
nested_scores = cross_val_score(model, X, y, cv=outer_cv)
print(f"Nested CV: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

# Now fit on ALL data with inner CV to get final model
model.fit(X, y)
print(f"Best params: {model.best_params_}")

Nested CV vs simple CV

Simple CV (select params by CV score, report that same CV score) is optimistically biased — you've implicitly used the validation folds to choose your model. Nested CV keeps the outer folds blind to hyperparameter selection, giving a truly unbiased generalization estimate. It's 5–10× more expensive but essential for rigorous reporting.

Practice questions

  1. What is the difference between k-fold and stratified k-fold cross-validation? (Answer: K-fold: randomly split data into k folds. May result in class imbalance within folds if the dataset is imbalanced. Stratified k-fold: ensure each fold has the same class proportion as the full dataset. Critical for imbalanced datasets: if 5% positive class, each fold has ~5% positives. Without stratification: some folds may have 0% or 10% positives, giving highly variable and unreliable estimates. Sklearn default: StratifiedKFold for classification tasks.)
  2. What is leave-one-out cross-validation (LOOCV) and when is it appropriate? (Answer: LOOCV: k = n (number of examples). Train on n-1 examples, test on the left-out one, repeat n times. Advantages: uses maximum training data, nearly unbiased estimate of generalisation. Disadvantages: computationally expensive (n training runs). High variance: each model differs by only one example. Use when: very small dataset (<50 examples) where leave-out test sets would be too small to estimate performance. For larger datasets: k=5 or k=10 fold CV is preferred.)
  3. What is the nested cross-validation and why is it needed for unbiased hyperparameter tuning? (Answer: Standard CV: use validation fold for hyperparameter selection, report best model's test score. Problem: the test estimate is optimistically biased — you selected hyperparameters that happened to work best on that fold. Nested CV: outer loop (k folds) estimates generalisation; inner loop (k' folds within each outer training set) selects hyperparameters. Each outer fold gives an unbiased estimate of the generalisation of the full training procedure (including hyperparameter selection). More expensive but honest.)
  4. What is the 'test set contamination' risk in cross-validation? (Answer: Contamination: information from the test fold leaks into preprocessing steps — normalisation, feature selection, dimensionality reduction. Example: compute PCA on the full dataset, THEN split for CV. The PCA transformation has seen test data — test fold is not truly held-out. Correct approach: embed preprocessing INSIDE the CV pipeline (sklearn Pipeline). PCA is refit on each training fold only. The test fold is never seen by any preprocessing step. This is the most common CV implementation mistake.)
  5. A dataset has 1000 examples. Compare 5-fold CV, 10-fold CV, and LOOCV on training set size, evaluation stability, and compute cost. (Answer: 5-fold CV: 800 training / 200 test per fold. 5 training runs. Moderate variance in estimates. Fastest. 10-fold CV: 900 training / 100 test per fold. 10 training runs. Lower variance than 5-fold (larger training, smaller test variability). 2× cost of 5-fold. LOOCV: 999 training / 1 test per fold. 1000 training runs. Nearly unbiased but high variance (single-example test sets are noisy). 200× cost of 5-fold. Practical choice: 5-fold or 10-fold for most use cases.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms