Definition

Unsupervised learning finds patterns, structures, and relationships in data without using labeled examples. Without being told what the 'right answer' is, algorithms discover natural groupings, reduce dimensionality, or identify anomalies purely from the statistical properties of the data itself.

Clustering

Clustering assigns data points to groups (clusters) based on similarity — without using labels. The goal: intra-cluster similarity high, inter-cluster similarity low. Different algorithms encode different notions of 'cluster':

K-Means objective: minimize total within-cluster sum of squared distances to cluster centroids μ_k. This is NP-hard in general — K-Means uses Lloyd's algorithm (greedy iterative refinement) which converges to a local optimum.

Comparing K-Means, DBSCAN, and Hierarchical Clustering

import numpy as np
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs, make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Dataset 1: spherical clusters → K-Means works well
X_blobs, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# Dataset 2: arbitrary shapes → K-Means fails, DBSCAN wins
X_moons, _ = make_moons(n_samples=500, noise=0.05, random_state=42)
X_moons = StandardScaler().fit_transform(X_moons)

# ── K-Means ───────────────────────────────────────────────────────
km = KMeans(n_clusters=4, n_init=10, random_state=42)
km.fit(X_blobs)
print(f"K-Means silhouette (blobs): {silhouette_score(X_blobs, km.labels_):.3f}")

# ── DBSCAN ────────────────────────────────────────────────────────
# eps = neighborhood radius; min_samples = min points to form a core point
db = DBSCAN(eps=0.3, min_samples=10)
db.fit(X_moons)
n_clusters = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise    = list(db.labels_).count(-1)
print(f"DBSCAN clusters found: {n_clusters}, noise points: {n_noise}")

# ── Hierarchical (Agglomerative) ──────────────────────────────────
hc = AgglomerativeClustering(n_clusters=4, linkage='ward')
hc.fit(X_blobs)
print(f"Hierarchical silhouette: {silhouette_score(X_blobs, hc.labels_):.3f}")

Algorithm	Cluster shape	Must specify k?	Handles outliers?	Scale
K-Means	Spherical, equal-size	Yes	No (outliers distort centroids)	O(n·k·iter) — fast
DBSCAN	Arbitrary	No (uses eps/min_samples)	Yes (marks as noise, label=-1)	O(n·log n) with index
Hierarchical	Arbitrary	Yes (post-hoc from dendrogram)	No	O(n²) — slow for large n
Gaussian Mixture (GMM)	Ellipsoidal	Yes	Soft assignments	O(n·k·d²) per EM step

Dimensionality reduction

High-dimensional data is difficult to visualize, store, and model — the 'curse of dimensionality' makes distances meaningless in very high dimensions. Dimensionality reduction projects data into fewer dimensions while preserving important structure:

PCA: linear projection W is chosen so that the k principal components (columns of W) maximize variance. Equivalent to the top-k eigenvectors of the data covariance matrix Σ = X^T X / n.

PCA and t-SNE for visualizing high-dimensional embedding spaces

import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Example: visualize 1536-dim text embeddings in 2D
# (Replace with your actual embeddings)
embeddings = np.random.randn(500, 1536)   # 500 embeddings, dim=1536
labels = np.random.randint(0, 5, 500)    # 5 topic clusters

# ── Step 1: Reduce to 50D with PCA (fast pre-processing for t-SNE) ──
pca = PCA(n_components=50)
embeddings_50d = pca.fit_transform(embeddings)
print(f"Variance explained by 50 PCs: {pca.explained_variance_ratio_.sum():.1%}")

# ── Step 2: t-SNE for 2D visualization ──────────────────────────────
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000,
            random_state=42, init='pca')
embeddings_2d = tsne.fit_transform(embeddings_50d)

# ── Alternative: UMAP (faster, preserves global structure better) ────
# import umap
# reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
# embeddings_2d_umap = reducer.fit_transform(embeddings)

print(f"t-SNE output shape: {embeddings_2d.shape}")  # (500, 2)

Method	Type	Preserves	Best for	Speed
PCA	Linear	Global variance, distances (approx)	Pre-processing, interpretable features	Very fast O(min(n,d)²·max(n,d))
t-SNE	Nonlinear	Local structure (neighborhoods)	2D/3D visualization	Slow O(n²) or O(n log n)
UMAP	Nonlinear	Local + global structure	Visualization + general purpose	Fast O(n^1.14)
Autoencoder	Nonlinear (learned)	Task-relevant features	Learned representations, generation	GPU needed for large models

Anomaly detection

Anomaly detection identifies data points that deviate significantly from the norm — typically without labeled examples of anomalies (since anomalies are rare and hard to collect labels for). The key challenge is defining 'normal' purely from normal data:

Three anomaly detection algorithms compared

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from sklearn.metrics import classification_report

# Generate data: mostly normal with 5% anomalies
X_normal, _ = make_blobs(n_samples=950, centers=1, cluster_std=0.5, random_state=42)
X_anomaly = np.random.uniform(-4, 4, size=(50, 2))  # outliers
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([1]*950 + [-1]*50)   # 1=normal, -1=anomaly

# ── Isolation Forest ──────────────────────────────────────────────
# Anomalies are easier to isolate → shorter average path length in random trees
iso = IsolationForest(contamination=0.05, random_state=42)
y_iso = iso.fit_predict(X)
print("Isolation Forest:")
print(classification_report(y_true, y_iso, target_names=["anomaly","normal"]))

# ── One-Class SVM ──────────────────────────────────────────────────
oc_svm = OneClassSVM(kernel='rbf', nu=0.05)  # nu ≈ fraction of outliers
y_svm = oc_svm.fit_predict(X)

# ── Local Outlier Factor ──────────────────────────────────────────
# Compares local density of a point to its neighbors
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_lof = lof.fit_predict(X)

Algorithm	Key idea	Best for	Limitation
Isolation Forest	Anomalies are isolated by fewer random splits	High-dimensional tabular data; fast	Struggles with very small anomaly clusters
One-Class SVM	Learn boundary around normal data	Low-dimensional, well-defined normal region	Slow at scale; sensitive to hyperparameters
Local Outlier Factor (LOF)	Compare local density to neighbors	Clustered data with varying density	Not scalable for >100K points
Autoencoder	Normal data reconstructs well; anomalies don't	High-dimensional, complex data (images, sequences)	Needs training data; threshold selection hard

Autoencoder reconstruction error

Train an autoencoder on normal data only. At inference, compute the reconstruction error ||x - decode(encode(x))||². Anomalies, being out-of-distribution, reconstruct poorly and have high error. Set a threshold on the 95th–99th percentile of training reconstruction errors.

Self-supervised learning — bridging supervised and unsupervised

Self-supervised learning generates labels automatically from the structure of unlabeled data — providing the scalability of unsupervised learning with the performance of supervised learning. All modern LLMs are built on self-supervised pretraining objectives:

Objective	How labels are generated	Model	Domain
Next-token prediction	Predict next word from preceding words	GPT, LLaMA, Claude (base)	Text generation, LLMs
Masked Language Modeling	Predict randomly masked tokens from context	BERT, RoBERTa, DeBERTa	Text understanding, embeddings
Masked Image Modeling	Reconstruct randomly masked image patches	MAE (Masked Autoencoder), BEiT	Vision, image features
Contrastive learning	Same image under different augmentations should be similar	SimCLR, MoCo, CLIP	Vision, multimodal alignment
Next frame prediction	Predict next video frame from prior frames	VideoGPT, Sora (partial)	Video understanding

Why self-supervised learning dominates

Human-labeled datasets are expensive to create and limited in scale. The entire internet provides billions of self-labeled examples for next-token prediction (every document predicts every word). This is why LLMs can scale to 15 trillion training tokens — something impossible with human annotation.

Generative unsupervised models

Generative models learn the data distribution p(x) and can sample new examples from it. Three major families dominate in 2025:

Model family	Training objective	Sampling	Quality	Stability	Examples
VAE	Maximize ELBO = reconstruction - KL(q\|\|p)	Single forward pass (fast)	Blurry for images	Stable	First generative DL models
GAN	Adversarial min-max game	Single forward pass (fast)	High for specific domains	Unstable (mode collapse)	StyleGAN3, BigGAN
Diffusion	Predict noise at random timestep t	20–1000 denoising steps	State-of-the-art	Very stable	Stable Diffusion, DALL-E 3, Midjourney

VAE Evidence Lower Bound (ELBO): maximize reconstruction likelihood while keeping the learned latent distribution q(z|x) close to the prior p(z) = N(0,I). The KL term forces a smooth, continuous latent space.

Why diffusion models won

Diffusion models (DDPM, 2020) achieved stable training and diverse, high-quality generation — solving GAN's mode collapse and VAE's blurriness problems simultaneously. By 2023, diffusion models (Stable Diffusion, Midjourney, DALL-E 3) had largely displaced GANs for state-of-the-art image generation. The tradeoff: sampling requires many denoising steps (slower than GAN's single pass).

Practice questions

What is the difference between clustering, density estimation, and dimensionality reduction as unsupervised learning tasks? (Answer: Clustering: assign each point to a group (cluster) — hard or soft assignment. Reveals grouping structure. K-means, DBSCAN, hierarchical clustering. Density estimation: model the probability distribution P(x) — generate new samples, detect outliers. GMM, KDE, VAE, normalising flows. Dimensionality reduction: find a low-dimensional representation that preserves structure. PCA, t-SNE, UMAP, autoencoders. All are unsupervised: no labels. They address different questions: who belongs together? what is likely? what is the compact representation?)
What is the elbow method for choosing k in k-means and what are its limitations? (Answer: Elbow method: run k-means for k=1,...,K, plot within-cluster sum of squares (WCSS) vs k. Look for the 'elbow' — where WCSS stops decreasing sharply and starts diminishing gradually. The elbow suggests the optimal k. Limitations: (1) No sharp elbow for many real datasets — curve is smooth. (2) WCSS always decreases with k — always improving artificially. (3) k-means assumes spherical clusters of similar size. Better alternative: silhouette score (measures cluster cohesion and separation); BIC/AIC for model selection; gap statistic for reference distribution comparison.)
What is the difference between PCA and autoencoders for dimensionality reduction? (Answer: PCA: linear dimensionality reduction. Finds orthogonal directions of maximum variance. Closed-form solution (SVD). Reconstruction = linear transformation. Best for Gaussian-distributed data, when linear structure captures most variance. Autoencoder: non-linear dimensionality reduction via neural encoder-decoder. Can learn complex, curved manifolds that PCA cannot represent. Requires training. Better for complex high-dimensional data (images, text). Trade-off: PCA is interpretable, fast, exact; autoencoders are powerful but black-box, require more data.)
What is the inductive bias of k-means and why does it fail for non-spherical clusters? (Answer: K-means minimises within-cluster L2 distance — implicitly assumes clusters are spherical (Voronoi cells are hyperplane boundaries — equal covariance). Fails for: elongated clusters (distance to centroid misleadingly large along minor axis), varying-density clusters, non-convex clusters. Alternatives: GMM (allows elliptical clusters via full covariance matrices), DBSCAN (density-based — no shape assumption), spectral clustering (graph Laplacian eigenvectors — handles non-convex shapes).)
What is contrastive learning and how has it enabled self-supervised pretraining? (Answer: Contrastive learning: learn representations by attracting augmented views of the same input ('positives') and repelling different inputs ('negatives') in embedding space. No labels needed — augmentation (crop, flip, colour jitter) provides positive pairs. Loss: InfoNCE = -log(exp(sim(z_i, z_j)/τ) / Σ_k exp(sim(z_i, z_k)/τ)). SimCLR, MoCo, CLIP use this principle. Achieves near-supervised ImageNet performance. CLIP: uses image-text pairs as positives — one of the most impactful self-supervised methods, enabling zero-shot classification.)

Unsupervised Learning

Clustering

Dimensionality reduction

Anomaly detection

Self-supervised learning — bridging supervised and unsupervised

Generative unsupervised models

Practice questions

Try LumiChats for ₹69

Related Terms