Hierarchical clustering builds a nested hierarchy of clusters represented as a dendrogram (tree). Agglomerative (bottom-up): start with each point as its own cluster, repeatedly merge the two most similar clusters until one remains. Divisive (top-down): start with all points in one cluster, recursively split into smaller clusters. The key decision is the linkage criterion — how to measure distance between clusters. Unlike K-Means, hierarchical clustering does not require specifying K beforehand — you cut the dendrogram at any level. GATE tests linkage types and complexity.
Real-life analogy: The family tree
Imagine organising people into a family tree. At the bottom are individual people. Moving up, siblings merge into a family unit. Families merge into extended families. Extended families merge into clans. Clans merge into a single human race at the top. Agglomerative clustering builds this bottom-up. A dendrogram visualises the entire tree — you can cut at any height to get any number of clusters you want.
Agglomerative clustering algorithm
- Start: n clusters, one per data point.
- Compute distance matrix: D(i,j) = distance between cluster i and cluster j.
- Merge: Find the pair of clusters with minimum distance. Merge them into one cluster.
- Update: Recompute distances from the new merged cluster to all others using the linkage criterion.
- Repeat steps 3–4 until one cluster remains. Record the merge history as a dendrogram.
| Linkage | Distance formula | Properties | When to use |
|---|---|---|---|
| Single linkage | min distance between any two points in clusters | Chaining effect — elongated clusters | Data with chains or non-spherical shapes |
| Complete linkage | max distance between any two points in clusters | Compact, spherical clusters | Well-separated, similar-sized clusters |
| Average linkage (UPGMA) | mean distance between all pairs | Compromise between single and complete | General purpose — often best |
| Ward linkage | increase in total WCSS from merging | Minimises variance — spherical clusters | Similar to K-Means; most commonly used |
Agglomerative clustering with dendrogram
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=30, centers=3,
cluster_std=0.8, random_state=42)
# ── scipy dendrogram (full hierarchy) ──
# Linkage types: 'single', 'complete', 'average', 'ward'
Z_ward = linkage(X, method='ward')
Z_single = linkage(X, method='single')
Z_complete = linkage(X, method='complete')
print("Ward linkage merging sequence (last 5):")
print(Z_ward[-5:]) # Columns: [cluster_i, cluster_j, distance, new_size]
# ── sklearn AgglomerativeClustering ──
for linkage_type in ['ward', 'complete', 'average', 'single']:
if linkage_type == 'ward':
agg = AgglomerativeClustering(n_clusters=3, linkage=linkage_type)
else:
agg = AgglomerativeClustering(n_clusters=3, linkage=linkage_type,
metric='euclidean')
labels = agg.fit_predict(X)
# Compute purity (when true labels known)
from scipy.stats import mode
purity = sum(mode(y_true[labels==k])[1][0] for k in range(3)) / len(X)
print(f"{linkage_type:<10} purity: {purity:.3f}")Cutting the dendrogram and comparing with K-Means
The dendrogram's y-axis shows the merge distance. Cutting horizontally at height h gives clusters — the number of vertical lines crossed equals the number of clusters at that cut. This is the key advantage: no need to specify K. You decide the resolution after seeing the full hierarchy.
| Property | K-Means | Agglomerative (Ward) | DBSCAN |
|---|---|---|---|
| K required upfront? | Yes | No (post-hoc cut) | No |
| Cluster shape | Spherical only | Spherical (Ward) | Arbitrary |
| Handles outliers | No | Partially | Yes (noise class) |
| Time complexity | O(nKdT) | O(n²log n) Ward; O(n³) naive | O(n log n) with spatial index |
| Interpretability | Centroid description | Dendrogram | Core/border/noise distinction |
GATE key: Single linkage chaining
Single linkage suffers from the chaining effect: if two clusters have just one pair of close points, they merge regardless of how far apart the rest are. This creates long, thin, snake-like clusters that are often not meaningful. Complete linkage avoids this by requiring ALL points in the two clusters to be close before merging.
Practice questions (GATE-style)
- What is the time complexity of naive agglomerative hierarchical clustering? (Answer: O(n³) — at each of n steps, compute all O(n²) pairwise distances. Optimised implementations (Ward with heap) achieve O(n² log n).)
- In complete linkage, the distance between two clusters is: (Answer: The maximum distance between any point in cluster A and any point in cluster B — the "furthest neighbour" approach.)
- Agglomerative clustering with Ward linkage minimises: (Answer: The increase in total within-cluster sum of squares (WCSS) at each merge — it minimises variance and tends to produce compact, spherical clusters.)
- A dendrogram is cut at height h=5.0, revealing 4 vertical line crossings. How many clusters result? (Answer: 4 clusters — each crossing corresponds to one cluster at that cut level.)
- When would you choose hierarchical over K-Means clustering? (Answer: When K is unknown in advance, when you need interpretable cluster merging history, when data is small (n<10000), or when the cluster shape is non-spherical and you use single-linkage.)
On LumiChats
Hierarchical clustering is used in NLP for concept taxonomy building: clustering word embeddings hierarchically creates a tree of related concepts. LumiChats uses similar ideas to organise topics within long documents for better RAG retrieval.
Try it free