Machine learning (ML) is a branch of AI where systems learn patterns from data to make predictions or decisions, without being explicitly programmed with rules for each case. Instead of writing 'if-then' logic for every scenario, ML algorithms find statistical patterns in training data and use those patterns to generalize to new, unseen examples.
ML vs traditional programming
The paradigm shift from traditional programming to machine learning:
| Paradigm | Input | Process | Output |
|---|---|---|---|
| Traditional programming | Data + Rules (code) | Deterministic execution | Output/answers |
| Machine learning | Data + Desired outputs | Optimization algorithm finds rules | A model (the learned rules) |
Traditional programming excels when rules are well-defined and enumerable — sorting algorithms, tax calculation, physics simulations. ML excels when rules are too complex or numerous to write explicitly: recognizing handwriting (the rules for what makes an 'A' vs a 'B' are nearly impossible to hand-code for all fonts, sizes, and styles), understanding natural language, predicting customer churn. The rule of thumb: if you'd need to write thousands of if-statements, use ML instead.
Traditional programming vs ML: spam detection
# ── Traditional: hand-coded rules ─────────────────────────
def is_spam_traditional(email: str) -> bool:
spam_words = ["buy now", "click here", "free money", "limited offer"]
return any(word in email.lower() for word in spam_words)
# Problem: misses novel spam phrases, has false positives, needs constant updates
# ── ML: learned rules from labeled data ────────────────────
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
spam_detector = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), max_features=50000)),
('clf', LogisticRegression(C=1.0, max_iter=1000))
])
# Training data: (email_text, is_spam_label) pairs — model learns rules automatically
spam_detector.fit(train_emails, train_labels) # learns from thousands of examples
# At inference:
prediction = spam_detector.predict_proba(["Buy Bitcoin now! Click here!"])
print(f"Spam probability: {prediction[0][1]:.3f}")The three types of ML
| Paradigm | Training signal | Core question | Examples |
|---|---|---|---|
| Supervised | Labeled (input, output) pairs | What output does this input map to? | Classification, regression, detection |
| Unsupervised | Unlabeled data only | What structure exists in this data? | Clustering, dimensionality reduction, anomaly detection |
| Reinforcement | Reward signals from environment | What sequence of actions maximizes reward? | Games, robotics, LLM alignment (RLHF) |
| Self-supervised | Labels derived from data itself | Predict masked/future parts of input | LLMs (next-token), BERT (masked tokens), MAE (image patches) |
Self-supervised is the foundation of LLMs
GPT, BERT, and all modern LLMs use self-supervised learning: the training "labels" come from the data itself (predict the next word, predict a masked word). No human labeling needed — the entire internet is a self-labeled training set. This is why such massive scale was achievable.
The ML workflow
A production ML project is rarely straightforward. Here is the standard iterative workflow:
- Problem definition — what are you predicting? What metric matters? What data do you have? What's the cost of false positives vs false negatives?
- Data collection & cleaning — remove duplicates, handle missing values, correct label errors. This step typically consumes 50-80% of total project time.
- Exploratory data analysis (EDA) — visualize distributions, identify correlations, spot outliers and anomalies, understand class balance.
- Feature engineering — transform raw data into informative numerical inputs. Often the highest-leverage step for classical ML.
- Model selection & training — start simple (logistic regression, gradient boosting) before trying complex models. Use cross-validation to compare.
- Evaluation — measure on held-out test set with task-appropriate metrics. Check performance on important data slices, not just overall.
- Deployment & monitoring — monitor for data drift, model degradation, and feedback loops. Retrain when performance drops.
Don't skip EDA
Jumping straight to modeling without thorough EDA is the most common beginner mistake. EDA reveals data quality issues, appropriate model families, and useful features — often making the difference between a model that works and one that doesn't.
Classical ML vs deep learning
| Property | Classical ML (XGBoost, SVM, RF) | Deep Learning (Neural Nets) |
|---|---|---|
| Best data type | Tabular / structured | Images, text, audio, video |
| Data requirements | 100–10K labeled examples often enough | Usually needs 10K–1M+ examples |
| Feature engineering | Manual — critical to success | Automatic — learns from raw input |
| Training time | Seconds to minutes on CPU | Hours to weeks on GPU/TPU |
| Interpretability | High (SHAP values, feature importance) | Low (black box by default) |
| Inference speed | Very fast, runs on CPU | Can be slow; requires GPU for large models |
| Hyperparameter tuning | Moderate | Extensive (architecture, lr, regularization) |
Practical decision rule: when to use each approach
def choose_ml_approach(data_type, n_samples, need_interpretability):
"""
Rule of thumb for ML approach selection.
"""
if data_type == "tabular":
if need_interpretability:
return "Logistic Regression or shallow Decision Tree"
elif n_samples < 1000:
return "XGBoost or Random Forest (few samples — no DL)"
else:
return "XGBoost / LightGBM — almost always best on tabular"
elif data_type in ("text", "code"):
return "Pretrained Transformer (fine-tune BERT/RoBERTa/LLaMA)"
elif data_type == "image":
if n_samples < 5000:
return "Pretrained CNN or ViT (transfer learning)"
else:
return "Fine-tuned ViT or ResNet-50"
elif data_type == "time_series":
if n_samples < 10000:
return "XGBoost with time-lag features"
else:
return "Temporal Fusion Transformer or N-BEATS"
return "Start with XGBoost, escalate to DL if needed"Generalization: the core challenge
Generalization — performing well on unseen data — is the entire point of ML. A model that only works on its training data is useless. The generalization gap is defined as:
A large positive gap means overfitting. A small gap with both losses high means underfitting. The ideal is small gap with both losses low.
PAC learning theory (Valiant, 1984) provides theoretical bounds on generalization: with probability at least 1-δ, the test error of an ERM model satisfies:
PAC bound: generalization gap decreases with training set size n. |H| is the hypothesis class complexity. More data always improves the bound — this is the theoretical foundation for why big datasets matter.
Distribution shift in deployment
Generalization in practice means more than test set performance. Distribution shift — when production data differs from training data — is the most common cause of ML system failures. Examples: a fraud model trained on 2020 spending patterns fails in 2024. A medical model trained on one hospital's data fails on another's. Always evaluate on data from the actual deployment distribution.
Practice questions
- What is the fundamental goal of machine learning formally stated? (Answer: A computer program learns from experience E with respect to task T, measured by performance P, if P on T improves with E (Mitchell 1997). Formal objective: find a function f: X → Y (hypothesis) from hypothesis space H that minimises expected loss L = E[ℓ(f(x), y)] over the true data distribution P(X,Y). Since we can't access P directly, we minimise empirical risk: (1/n)Σℓ(f(xᵢ), yᵢ) over training data and generalise via regularisation.)
- What is the difference between a model, a hypothesis, and a parameter in ML? (Answer: Model (architecture): the family of functions the learning algorithm can produce (e.g., linear models, neural networks, decision trees). Defines the hypothesis space H. Hypothesis: a specific function f_θ ∈ H selected by training — the model with specific parameter values. Parameter θ: the numerical values learned during training (weights, biases, split thresholds). Training finds θ* = argmin_θ L(θ). The model is the structure; the hypothesis is the structure + specific learned parameters.)
- What is the no-free-lunch theorem and what does it mean practically? (Answer: No-free-lunch theorem (Wolpert & Macready): no ML algorithm performs better than random on ALL possible problems — averaged over all possible data distributions. Any algorithm that outperforms on some problems must underperform on others. Practical meaning: there is no universally best ML algorithm. You must choose algorithms based on domain knowledge and assumptions about your data. This is why practitioners try multiple models rather than always using one; domain-specific inductive biases (CNNs for images, transformers for text) exploit assumptions about the problem structure.)
- What is the bias-variance trade-off and how does it manifest in model selection? (Answer: Total error = Bias² + Variance + Irreducible noise. Bias: systematic error from wrong assumptions in the model (underfitting). Variance: sensitivity to training data noise (overfitting). High bias (simple model, e.g., linear regression on non-linear data): consistent but systematically wrong predictions. High variance (complex model, e.g., depth-20 decision tree): accurate on training data but wildly different on test data. Model selection navigates this: polynomial degree, regularisation strength, tree depth, neural network size all trade off bias vs variance.)
- What is the difference between a generative model and a discriminative model? (Answer: Discriminative: models P(Y|X) directly — the conditional probability of class given features. Directly optimised for the classification/regression task. Examples: logistic regression, SVM, neural classifiers. Generative: models P(X,Y) = P(X|Y)P(Y) — the joint distribution. Can generate new examples and handle missing features. Examples: Naive Bayes, GMM, HMM, VAE, GAN. Discriminative models usually achieve better classification accuracy (optimised for the task); generative models offer broader capabilities (sampling, density estimation, imputation).)