Definition

Machine learning (ML) is a branch of AI where systems learn patterns from data to make predictions or decisions, without being explicitly programmed with rules for each case. Instead of writing 'if-then' logic for every scenario, ML algorithms find statistical patterns in training data and use those patterns to generalize to new, unseen examples.

ML vs traditional programming

The paradigm shift from traditional programming to machine learning:

Paradigm	Input	Process	Output
Traditional programming	Data + Rules (code)	Deterministic execution	Output/answers
Machine learning	Data + Desired outputs	Optimization algorithm finds rules	A model (the learned rules)

Traditional programming excels when rules are well-defined and enumerable — sorting algorithms, tax calculation, physics simulations. ML excels when rules are too complex or numerous to write explicitly: recognizing handwriting (the rules for what makes an 'A' vs a 'B' are nearly impossible to hand-code for all fonts, sizes, and styles), understanding natural language, predicting customer churn. The rule of thumb: if you'd need to write thousands of if-statements, use ML instead.

Traditional programming vs ML: spam detection

# ── Traditional: hand-coded rules ─────────────────────────
def is_spam_traditional(email: str) -> bool:
    spam_words = ["buy now", "click here", "free money", "limited offer"]
    return any(word in email.lower() for word in spam_words)
# Problem: misses novel spam phrases, has false positives, needs constant updates

# ── ML: learned rules from labeled data ────────────────────
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

spam_detector = Pipeline([
    ('tfidf',  TfidfVectorizer(ngram_range=(1,2), max_features=50000)),
    ('clf',    LogisticRegression(C=1.0, max_iter=1000))
])

# Training data: (email_text, is_spam_label) pairs — model learns rules automatically
spam_detector.fit(train_emails, train_labels)   # learns from thousands of examples

# At inference:
prediction = spam_detector.predict_proba(["Buy Bitcoin now! Click here!"])
print(f"Spam probability: {prediction[0][1]:.3f}")

The three types of ML

Paradigm	Training signal	Core question	Examples
Supervised	Labeled (input, output) pairs	What output does this input map to?	Classification, regression, detection
Unsupervised	Unlabeled data only	What structure exists in this data?	Clustering, dimensionality reduction, anomaly detection
Reinforcement	Reward signals from environment	What sequence of actions maximizes reward?	Games, robotics, LLM alignment (RLHF)
Self-supervised	Labels derived from data itself	Predict masked/future parts of input	LLMs (next-token), BERT (masked tokens), MAE (image patches)

Self-supervised is the foundation of LLMs

GPT, BERT, and all modern LLMs use self-supervised learning: the training "labels" come from the data itself (predict the next word, predict a masked word). No human labeling needed — the entire internet is a self-labeled training set. This is why such massive scale was achievable.

The ML workflow

A production ML project is rarely straightforward. Here is the standard iterative workflow:

Problem definition — what are you predicting? What metric matters? What data do you have? What's the cost of false positives vs false negatives?
Data collection & cleaning — remove duplicates, handle missing values, correct label errors. This step typically consumes 50-80% of total project time.
Exploratory data analysis (EDA) — visualize distributions, identify correlations, spot outliers and anomalies, understand class balance.
Feature engineering — transform raw data into informative numerical inputs. Often the highest-leverage step for classical ML.
Model selection & training — start simple (logistic regression, gradient boosting) before trying complex models. Use cross-validation to compare.
Evaluation — measure on held-out test set with task-appropriate metrics. Check performance on important data slices, not just overall.
Deployment & monitoring — monitor for data drift, model degradation, and feedback loops. Retrain when performance drops.

Don't skip EDA

Jumping straight to modeling without thorough EDA is the most common beginner mistake. EDA reveals data quality issues, appropriate model families, and useful features — often making the difference between a model that works and one that doesn't.

Classical ML vs deep learning

Property	Classical ML (XGBoost, SVM, RF)	Deep Learning (Neural Nets)
Best data type	Tabular / structured	Images, text, audio, video
Data requirements	100–10K labeled examples often enough	Usually needs 10K–1M+ examples
Feature engineering	Manual — critical to success	Automatic — learns from raw input
Training time	Seconds to minutes on CPU	Hours to weeks on GPU/TPU
Interpretability	High (SHAP values, feature importance)	Low (black box by default)
Inference speed	Very fast, runs on CPU	Can be slow; requires GPU for large models
Hyperparameter tuning	Moderate	Extensive (architecture, lr, regularization)

Practical decision rule: when to use each approach

def choose_ml_approach(data_type, n_samples, need_interpretability):
    """
    Rule of thumb for ML approach selection.
    """
    if data_type == "tabular":
        if need_interpretability:
            return "Logistic Regression or shallow Decision Tree"
        elif n_samples < 1000:
            return "XGBoost or Random Forest (few samples — no DL)"
        else:
            return "XGBoost / LightGBM — almost always best on tabular"

    elif data_type in ("text", "code"):
        return "Pretrained Transformer (fine-tune BERT/RoBERTa/LLaMA)"

    elif data_type == "image":
        if n_samples < 5000:
            return "Pretrained CNN or ViT (transfer learning)"
        else:
            return "Fine-tuned ViT or ResNet-50"

    elif data_type == "time_series":
        if n_samples < 10000:
            return "XGBoost with time-lag features"
        else:
            return "Temporal Fusion Transformer or N-BEATS"

    return "Start with XGBoost, escalate to DL if needed"

Generalization: the core challenge

Generalization — performing well on unseen data — is the entire point of ML. A model that only works on its training data is useless. The generalization gap is defined as:

A large positive gap means overfitting. A small gap with both losses high means underfitting. The ideal is small gap with both losses low.

PAC learning theory (Valiant, 1984) provides theoretical bounds on generalization: with probability at least 1-δ, the test error of an ERM model satisfies:

PAC bound: generalization gap decreases with training set size n. |H| is the hypothesis class complexity. More data always improves the bound — this is the theoretical foundation for why big datasets matter.

Distribution shift in deployment

Generalization in practice means more than test set performance. Distribution shift — when production data differs from training data — is the most common cause of ML system failures. Examples: a fraud model trained on 2020 spending patterns fails in 2024. A medical model trained on one hospital's data fails on another's. Always evaluate on data from the actual deployment distribution.

Practice questions

What is the fundamental goal of machine learning formally stated? (Answer: A computer program learns from experience E with respect to task T, measured by performance P, if P on T improves with E (Mitchell 1997). Formal objective: find a function f: X → Y (hypothesis) from hypothesis space H that minimises expected loss L = E[ℓ(f(x), y)] over the true data distribution P(X,Y). Since we can't access P directly, we minimise empirical risk: (1/n)Σℓ(f(xᵢ), yᵢ) over training data and generalise via regularisation.)
What is the difference between a model, a hypothesis, and a parameter in ML? (Answer: Model (architecture): the family of functions the learning algorithm can produce (e.g., linear models, neural networks, decision trees). Defines the hypothesis space H. Hypothesis: a specific function f_θ ∈ H selected by training — the model with specific parameter values. Parameter θ: the numerical values learned during training (weights, biases, split thresholds). Training finds θ* = argmin_θ L(θ). The model is the structure; the hypothesis is the structure + specific learned parameters.)
What is the no-free-lunch theorem and what does it mean practically? (Answer: No-free-lunch theorem (Wolpert & Macready): no ML algorithm performs better than random on ALL possible problems — averaged over all possible data distributions. Any algorithm that outperforms on some problems must underperform on others. Practical meaning: there is no universally best ML algorithm. You must choose algorithms based on domain knowledge and assumptions about your data. This is why practitioners try multiple models rather than always using one; domain-specific inductive biases (CNNs for images, transformers for text) exploit assumptions about the problem structure.)
What is the bias-variance trade-off and how does it manifest in model selection? (Answer: Total error = Bias² + Variance + Irreducible noise. Bias: systematic error from wrong assumptions in the model (underfitting). Variance: sensitivity to training data noise (overfitting). High bias (simple model, e.g., linear regression on non-linear data): consistent but systematically wrong predictions. High variance (complex model, e.g., depth-20 decision tree): accurate on training data but wildly different on test data. Model selection navigates this: polynomial degree, regularisation strength, tree depth, neural network size all trade off bias vs variance.)
What is the difference between a generative model and a discriminative model? (Answer: Discriminative: models P(Y|X) directly — the conditional probability of class given features. Directly optimised for the classification/regression task. Examples: logistic regression, SVM, neural classifiers. Generative: models P(X,Y) = P(X|Y)P(Y) — the joint distribution. Can generate new examples and handle missing features. Examples: Naive Bayes, GMM, HMM, VAE, GAN. Discriminative models usually achieve better classification accuracy (optimised for the task); generative models offer broader capabilities (sampling, density estimation, imputation).)

Machine Learning

ML vs traditional programming

The three types of ML

The ML workflow

Classical ML vs deep learning

Generalization: the core challenge

Practice questions

Try LumiChats for ₹69

Related Terms