K-Means is the most widely used unsupervised clustering algorithm. It partitions n data points into K clusters by iteratively assigning each point to its nearest cluster centroid and recomputing centroids. K-Medoid (PAM — Partitioning Around Medoids) is a more robust variant that uses actual data points as cluster centres (medoids) instead of means, making it resistant to outliers. Both are core GATE DS&AI topics — expect questions on convergence, the elbow method for choosing K, and complexity.
Real-life analogy: Organising files into folders
Imagine you have 1000 documents with no labels. You want to group them into 5 topic folders. K-Means does this: randomly place 5 folder labels across the documents. Assign every document to its nearest folder. Move each folder label to the centre of its documents. Repeat until folders stop moving. The algorithm converges to a stable grouping where each document is in its most similar folder.
K-Means algorithm step by step
- Initialise: Randomly select K data points as initial centroids (or use K-Means++ for smarter initialisation).
- Assignment step: Assign each data point xᵢ to the nearest centroid: cluster(i) = argmin_k ||xᵢ − μₖ||².
- Update step: Recompute each centroid as the mean of assigned points: μₖ = (1/|Cₖ|) Σ_{i∈Cₖ} xᵢ.
- Repeat steps 2–3 until assignments do not change (convergence) or max iterations reached.
K-Means objective: minimise Within-Cluster Sum of Squares (WCSS). WCSS always decreases or stays the same at each iteration — K-Means is guaranteed to converge. However, it may converge to a local minimum, not the global minimum. Running multiple times with different initialisations helps.
K-Means from scratch and sklearn with elbow method
import numpy as np
import matplotlib
matplotlib.use('Agg')
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
# Generate 4-cluster data
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.8, random_state=42)
# ── K-Means from scratch ──
class KMeansScratch:
def __init__(self, k, max_iter=100):
self.k, self.max_iter = k, max_iter
def fit(self, X):
# Random initialisation
idx = np.random.choice(len(X), self.k, replace=False)
self.centroids = X[idx].copy()
for _ in range(self.max_iter):
# Assignment
dists = np.linalg.norm(X[:, None] - self.centroids[None], axis=2)
labels = np.argmin(dists, axis=1)
# Update
new_centroids = np.array([X[labels==k].mean(axis=0) for k in range(self.k)])
if np.allclose(self.centroids, new_centroids):
break # Converged
self.centroids = new_centroids
self.labels_ = labels
self.inertia_ = sum(np.sum((X[labels==k] - self.centroids[k])**2) for k in range(self.k))
return self
# Elbow method to find optimal K
wcss = []
sil_scores = []
K_range = range(2, 10)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X)
wcss.append(km.inertia_)
sil_scores.append(silhouette_score(X, km.labels_))
# Best K by silhouette (closest to 1.0 is best)
best_k = K_range[np.argmax(sil_scores)]
print(f"WCSS values: {[int(w) for w in wcss]}")
print(f"Silhouette scores: {[round(s,3) for s in sil_scores]}")
print(f"Best K (silhouette): {best_k}")Choosing K — Elbow method and Silhouette
Elbow method: Plot WCSS vs K. WCSS decreases sharply initially, then levels off. The 'elbow' — where the curve bends — is the optimal K. Subjective, but practical.
Silhouette score: For each point, s = (b−a)/max(a,b) where a = mean intra-cluster distance, b = mean nearest-cluster distance. s ∈ [−1, 1]. Higher is better. Average silhouette over all points gives a rigorous measure to compare different K values.
| Property | K-Means | K-Medoid (PAM) |
|---|---|---|
| Cluster centre | Mean of cluster (may not exist in data) | Actual data point (medoid) |
| Robustness to outliers | Low — mean shifts toward outliers | High — medoid is unaffected by outliers |
| Distance metric | Must use Euclidean (squared) | Any distance metric works |
| Computational cost | O(nKdT) — fast | O(K(n−K)²) — slower |
| Best for | Spherical clusters, no outliers | Data with outliers, non-Euclidean distances |
K-Means limitations (GATE tested)
K-Means assumes: (1) Spherical cluster shapes — fails on elongated or non-convex clusters. (2) Similar cluster sizes — large clusters split, small clusters merge. (3) Similar cluster densities — dense clusters absorb sparse ones. (4) Euclidean distance — fails for categorical data or non-Euclidean spaces. Alternatives: DBSCAN for arbitrary shapes, GMM for probabilistic clusters, K-Medoid for robustness.
Practice questions (GATE-style)
- K-Means converges when: (Answer: The cluster assignments do not change between iterations (centroids stop moving). WCSS is guaranteed to be non-increasing, so the algorithm always terminates.)
- Why does K-Means++ give better results than random initialisation? (Answer: K-Means++ selects initial centroids probabilistically — each new centroid is chosen with probability proportional to its distance from the nearest existing centroid. This spreads centroids better, reducing risk of poor local minima.)
- If all points in a dataset are equidistant from each other, K-Means with K=2 will: (Answer: Assign points arbitrarily — any partition is equally valid. Distance-based clustering fails when all pairwise distances are equal.)
- What is the time complexity of one iteration of K-Means? (Answer: O(n × K × d) — for each of n points, compute distance to each of K centroids in d dimensions.)
- A silhouette score of −0.3 for a point means: (Answer: The point is closer to a different cluster than its own — it is likely misclassified. Values near 0 = on cluster boundary; near +1 = well inside own cluster; near −1 = in wrong cluster.)
On LumiChats
K-Means is used in LLM research for clustering sentence embeddings — understanding what topics a model knows about, finding which document chunks are similar, and compressing token vocabularies (K-Means on character n-gram embeddings was used in early subword tokenisation methods).
Try it free