Glossary/Unsupervised Learning
Machine Learning

Unsupervised Learning

Finding hidden structure in unlabeled data.


Definition

Unsupervised learning finds patterns, structures, and relationships in data without using labeled examples. Without being told what the 'right answer' is, algorithms discover natural groupings, reduce dimensionality, or identify anomalies purely from the statistical properties of the data itself.

Clustering

Clustering assigns data points to groups (clusters) based on similarity — without using labels. The goal: intra-cluster similarity high, inter-cluster similarity low. Different algorithms encode different notions of 'cluster':

K-Means objective: minimize total within-cluster sum of squared distances to cluster centroids μ_k. This is NP-hard in general — K-Means uses Lloyd's algorithm (greedy iterative refinement) which converges to a local optimum.

Comparing K-Means, DBSCAN, and Hierarchical Clustering

import numpy as np
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs, make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Dataset 1: spherical clusters → K-Means works well
X_blobs, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# Dataset 2: arbitrary shapes → K-Means fails, DBSCAN wins
X_moons, _ = make_moons(n_samples=500, noise=0.05, random_state=42)
X_moons = StandardScaler().fit_transform(X_moons)

# ── K-Means ───────────────────────────────────────────────────────
km = KMeans(n_clusters=4, n_init=10, random_state=42)
km.fit(X_blobs)
print(f"K-Means silhouette (blobs): {silhouette_score(X_blobs, km.labels_):.3f}")

# ── DBSCAN ────────────────────────────────────────────────────────
# eps = neighborhood radius; min_samples = min points to form a core point
db = DBSCAN(eps=0.3, min_samples=10)
db.fit(X_moons)
n_clusters = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise    = list(db.labels_).count(-1)
print(f"DBSCAN clusters found: {n_clusters}, noise points: {n_noise}")

# ── Hierarchical (Agglomerative) ──────────────────────────────────
hc = AgglomerativeClustering(n_clusters=4, linkage='ward')
hc.fit(X_blobs)
print(f"Hierarchical silhouette: {silhouette_score(X_blobs, hc.labels_):.3f}")
AlgorithmCluster shapeMust specify k?Handles outliers?Scale
K-MeansSpherical, equal-sizeYesNo (outliers distort centroids)O(n·k·iter) — fast
DBSCANArbitraryNo (uses eps/min_samples)Yes (marks as noise, label=-1)O(n·log n) with index
HierarchicalArbitraryYes (post-hoc from dendrogram)NoO(n²) — slow for large n
Gaussian Mixture (GMM)EllipsoidalYesSoft assignmentsO(n·k·d²) per EM step

Dimensionality reduction

High-dimensional data is difficult to visualize, store, and model — the 'curse of dimensionality' makes distances meaningless in very high dimensions. Dimensionality reduction projects data into fewer dimensions while preserving important structure:

PCA: linear projection W is chosen so that the k principal components (columns of W) maximize variance. Equivalent to the top-k eigenvectors of the data covariance matrix Σ = X^T X / n.

PCA and t-SNE for visualizing high-dimensional embedding spaces

import numpy as np
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Example: visualize 1536-dim text embeddings in 2D
# (Replace with your actual embeddings)
embeddings = np.random.randn(500, 1536)   # 500 embeddings, dim=1536
labels = np.random.randint(0, 5, 500)    # 5 topic clusters

# ── Step 1: Reduce to 50D with PCA (fast pre-processing for t-SNE) ──
pca = PCA(n_components=50)
embeddings_50d = pca.fit_transform(embeddings)
print(f"Variance explained by 50 PCs: {pca.explained_variance_ratio_.sum():.1%}")

# ── Step 2: t-SNE for 2D visualization ──────────────────────────────
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000,
            random_state=42, init='pca')
embeddings_2d = tsne.fit_transform(embeddings_50d)

# ── Alternative: UMAP (faster, preserves global structure better) ────
# import umap
# reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
# embeddings_2d_umap = reducer.fit_transform(embeddings)

print(f"t-SNE output shape: {embeddings_2d.shape}")  # (500, 2)
MethodTypePreservesBest forSpeed
PCALinearGlobal variance, distances (approx)Pre-processing, interpretable featuresVery fast O(min(n,d)²·max(n,d))
t-SNENonlinearLocal structure (neighborhoods)2D/3D visualizationSlow O(n²) or O(n log n)
UMAPNonlinearLocal + global structureVisualization + general purposeFast O(n^1.14)
AutoencoderNonlinear (learned)Task-relevant featuresLearned representations, generationGPU needed for large models

Anomaly detection

Anomaly detection identifies data points that deviate significantly from the norm — typically without labeled examples of anomalies (since anomalies are rare and hard to collect labels for). The key challenge is defining 'normal' purely from normal data:

Three anomaly detection algorithms compared

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from sklearn.metrics import classification_report

# Generate data: mostly normal with 5% anomalies
X_normal, _ = make_blobs(n_samples=950, centers=1, cluster_std=0.5, random_state=42)
X_anomaly = np.random.uniform(-4, 4, size=(50, 2))  # outliers
X = np.vstack([X_normal, X_anomaly])
y_true = np.array([1]*950 + [-1]*50)   # 1=normal, -1=anomaly

# ── Isolation Forest ──────────────────────────────────────────────
# Anomalies are easier to isolate → shorter average path length in random trees
iso = IsolationForest(contamination=0.05, random_state=42)
y_iso = iso.fit_predict(X)
print("Isolation Forest:")
print(classification_report(y_true, y_iso, target_names=["anomaly","normal"]))

# ── One-Class SVM ──────────────────────────────────────────────────
oc_svm = OneClassSVM(kernel='rbf', nu=0.05)  # nu ≈ fraction of outliers
y_svm = oc_svm.fit_predict(X)

# ── Local Outlier Factor ──────────────────────────────────────────
# Compares local density of a point to its neighbors
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_lof = lof.fit_predict(X)
AlgorithmKey ideaBest forLimitation
Isolation ForestAnomalies are isolated by fewer random splitsHigh-dimensional tabular data; fastStruggles with very small anomaly clusters
One-Class SVMLearn boundary around normal dataLow-dimensional, well-defined normal regionSlow at scale; sensitive to hyperparameters
Local Outlier Factor (LOF)Compare local density to neighborsClustered data with varying densityNot scalable for >100K points
AutoencoderNormal data reconstructs well; anomalies don'tHigh-dimensional, complex data (images, sequences)Needs training data; threshold selection hard

Autoencoder reconstruction error

Train an autoencoder on normal data only. At inference, compute the reconstruction error ||x - decode(encode(x))||². Anomalies, being out-of-distribution, reconstruct poorly and have high error. Set a threshold on the 95th–99th percentile of training reconstruction errors.

Self-supervised learning — bridging supervised and unsupervised

Self-supervised learning generates labels automatically from the structure of unlabeled data — providing the scalability of unsupervised learning with the performance of supervised learning. All modern LLMs are built on self-supervised pretraining objectives:

ObjectiveHow labels are generatedModelDomain
Next-token predictionPredict next word from preceding wordsGPT, LLaMA, Claude (base)Text generation, LLMs
Masked Language ModelingPredict randomly masked tokens from contextBERT, RoBERTa, DeBERTaText understanding, embeddings
Masked Image ModelingReconstruct randomly masked image patchesMAE (Masked Autoencoder), BEiTVision, image features
Contrastive learningSame image under different augmentations should be similarSimCLR, MoCo, CLIPVision, multimodal alignment
Next frame predictionPredict next video frame from prior framesVideoGPT, Sora (partial)Video understanding

Why self-supervised learning dominates

Human-labeled datasets are expensive to create and limited in scale. The entire internet provides billions of self-labeled examples for next-token prediction (every document predicts every word). This is why LLMs can scale to 15 trillion training tokens — something impossible with human annotation.

Generative unsupervised models

Generative models learn the data distribution p(x) and can sample new examples from it. Three major families dominate in 2025:

Model familyTraining objectiveSamplingQualityStabilityExamples
VAEMaximize ELBO = reconstruction - KL(q||p)Single forward pass (fast)Blurry for imagesStableFirst generative DL models
GANAdversarial min-max gameSingle forward pass (fast)High for specific domainsUnstable (mode collapse)StyleGAN3, BigGAN
DiffusionPredict noise at random timestep t20–1000 denoising stepsState-of-the-artVery stableStable Diffusion, DALL-E 3, Midjourney

VAE Evidence Lower Bound (ELBO): maximize reconstruction likelihood while keeping the learned latent distribution q(z|x) close to the prior p(z) = N(0,I). The KL term forces a smooth, continuous latent space.

Why diffusion models won

Diffusion models (DDPM, 2020) achieved stable training and diverse, high-quality generation — solving GAN's mode collapse and VAE's blurriness problems simultaneously. By 2023, diffusion models (Stable Diffusion, Midjourney, DALL-E 3) had largely displaced GANs for state-of-the-art image generation. The tradeoff: sampling requires many denoising steps (slower than GAN's single pass).

Practice questions

  1. What is the difference between clustering, density estimation, and dimensionality reduction as unsupervised learning tasks? (Answer: Clustering: assign each point to a group (cluster) — hard or soft assignment. Reveals grouping structure. K-means, DBSCAN, hierarchical clustering. Density estimation: model the probability distribution P(x) — generate new samples, detect outliers. GMM, KDE, VAE, normalising flows. Dimensionality reduction: find a low-dimensional representation that preserves structure. PCA, t-SNE, UMAP, autoencoders. All are unsupervised: no labels. They address different questions: who belongs together? what is likely? what is the compact representation?)
  2. What is the elbow method for choosing k in k-means and what are its limitations? (Answer: Elbow method: run k-means for k=1,...,K, plot within-cluster sum of squares (WCSS) vs k. Look for the 'elbow' — where WCSS stops decreasing sharply and starts diminishing gradually. The elbow suggests the optimal k. Limitations: (1) No sharp elbow for many real datasets — curve is smooth. (2) WCSS always decreases with k — always improving artificially. (3) k-means assumes spherical clusters of similar size. Better alternative: silhouette score (measures cluster cohesion and separation); BIC/AIC for model selection; gap statistic for reference distribution comparison.)
  3. What is the difference between PCA and autoencoders for dimensionality reduction? (Answer: PCA: linear dimensionality reduction. Finds orthogonal directions of maximum variance. Closed-form solution (SVD). Reconstruction = linear transformation. Best for Gaussian-distributed data, when linear structure captures most variance. Autoencoder: non-linear dimensionality reduction via neural encoder-decoder. Can learn complex, curved manifolds that PCA cannot represent. Requires training. Better for complex high-dimensional data (images, text). Trade-off: PCA is interpretable, fast, exact; autoencoders are powerful but black-box, require more data.)
  4. What is the inductive bias of k-means and why does it fail for non-spherical clusters? (Answer: K-means minimises within-cluster L2 distance — implicitly assumes clusters are spherical (Voronoi cells are hyperplane boundaries — equal covariance). Fails for: elongated clusters (distance to centroid misleadingly large along minor axis), varying-density clusters, non-convex clusters. Alternatives: GMM (allows elliptical clusters via full covariance matrices), DBSCAN (density-based — no shape assumption), spectral clustering (graph Laplacian eigenvectors — handles non-convex shapes).)
  5. What is contrastive learning and how has it enabled self-supervised pretraining? (Answer: Contrastive learning: learn representations by attracting augmented views of the same input ('positives') and repelling different inputs ('negatives') in embedding space. No labels needed — augmentation (crop, flip, colour jitter) provides positive pairs. Loss: InfoNCE = -log(exp(sim(z_i, z_j)/τ) / Σ_k exp(sim(z_i, z_k)/τ)). SimCLR, MoCo, CLIP use this principle. Achieves near-supervised ImageNet performance. CLIP: uses image-text pairs as positives — one of the most impactful self-supervised methods, enabling zero-shot classification.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms