Definition

K-Means is the most widely used unsupervised clustering algorithm. It partitions n data points into K clusters by iteratively assigning each point to its nearest cluster centroid and recomputing centroids. K-Medoid (PAM — Partitioning Around Medoids) is a more robust variant that uses actual data points as cluster centres (medoids) instead of means, making it resistant to outliers. Both are core GATE DS&AI topics — expect questions on convergence, the elbow method for choosing K, and complexity.

Real-life analogy: Organising files into folders

Imagine you have 1000 documents with no labels. You want to group them into 5 topic folders. K-Means does this: randomly place 5 folder labels across the documents. Assign every document to its nearest folder. Move each folder label to the centre of its documents. Repeat until folders stop moving. The algorithm converges to a stable grouping where each document is in its most similar folder.

K-Means algorithm step by step

Initialise: Randomly select K data points as initial centroids (or use K-Means++ for smarter initialisation).
Assignment step: Assign each data point xᵢ to the nearest centroid: cluster(i) = argmin_k ||xᵢ − μₖ||².
Update step: Recompute each centroid as the mean of assigned points: μₖ = (1/|Cₖ|) Σ_{i∈Cₖ} xᵢ.
Repeat steps 2–3 until assignments do not change (convergence) or max iterations reached.

K-Means objective: minimise Within-Cluster Sum of Squares (WCSS). WCSS always decreases or stays the same at each iteration — K-Means is guaranteed to converge. However, it may converge to a local minimum, not the global minimum. Running multiple times with different initialisations helps.

K-Means from scratch and sklearn with elbow method

import numpy as np
import matplotlib
matplotlib.use('Agg')
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Generate 4-cluster data
X, y_true = make_blobs(n_samples=300, centers=4,
                        cluster_std=0.8, random_state=42)

# ── K-Means from scratch ──
class KMeansScratch:
    def __init__(self, k, max_iter=100):
        self.k, self.max_iter = k, max_iter

    def fit(self, X):
        # Random initialisation
        idx = np.random.choice(len(X), self.k, replace=False)
        self.centroids = X[idx].copy()

        for _ in range(self.max_iter):
            # Assignment
            dists    = np.linalg.norm(X[:, None] - self.centroids[None], axis=2)
            labels   = np.argmin(dists, axis=1)
            # Update
            new_centroids = np.array([X[labels==k].mean(axis=0) for k in range(self.k)])
            if np.allclose(self.centroids, new_centroids):
                break                   # Converged
            self.centroids = new_centroids
        self.labels_ = labels
        self.inertia_ = sum(np.sum((X[labels==k] - self.centroids[k])**2) for k in range(self.k))
        return self

# Elbow method to find optimal K
wcss = []
sil_scores = []
K_range = range(2, 10)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    wcss.append(km.inertia_)
    sil_scores.append(silhouette_score(X, km.labels_))

# Best K by silhouette (closest to 1.0 is best)
best_k = K_range[np.argmax(sil_scores)]
print(f"WCSS values: {[int(w) for w in wcss]}")
print(f"Silhouette scores: {[round(s,3) for s in sil_scores]}")
print(f"Best K (silhouette): {best_k}")

Choosing K — Elbow method and Silhouette

Elbow method: Plot WCSS vs K. WCSS decreases sharply initially, then levels off. The 'elbow' — where the curve bends — is the optimal K. Subjective, but practical.

Silhouette score: For each point, s = (b−a)/max(a,b) where a = mean intra-cluster distance, b = mean nearest-cluster distance. s ∈ [−1, 1]. Higher is better. Average silhouette over all points gives a rigorous measure to compare different K values.

Property	K-Means	K-Medoid (PAM)
Cluster centre	Mean of cluster (may not exist in data)	Actual data point (medoid)
Robustness to outliers	Low — mean shifts toward outliers	High — medoid is unaffected by outliers
Distance metric	Must use Euclidean (squared)	Any distance metric works
Computational cost	O(nKdT) — fast	O(K(n−K)²) — slower
Best for	Spherical clusters, no outliers	Data with outliers, non-Euclidean distances

K-Means limitations (GATE tested)

K-Means assumes: (1) Spherical cluster shapes — fails on elongated or non-convex clusters. (2) Similar cluster sizes — large clusters split, small clusters merge. (3) Similar cluster densities — dense clusters absorb sparse ones. (4) Euclidean distance — fails for categorical data or non-Euclidean spaces. Alternatives: DBSCAN for arbitrary shapes, GMM for probabilistic clusters, K-Medoid for robustness.

Practice questions (GATE-style)

K-Means converges when: (Answer: The cluster assignments do not change between iterations (centroids stop moving). WCSS is guaranteed to be non-increasing, so the algorithm always terminates.)
Why does K-Means++ give better results than random initialisation? (Answer: K-Means++ selects initial centroids probabilistically — each new centroid is chosen with probability proportional to its distance from the nearest existing centroid. This spreads centroids better, reducing risk of poor local minima.)
If all points in a dataset are equidistant from each other, K-Means with K=2 will: (Answer: Assign points arbitrarily — any partition is equally valid. Distance-based clustering fails when all pairwise distances are equal.)
What is the time complexity of one iteration of K-Means? (Answer: O(n × K × d) — for each of n points, compute distance to each of K centroids in d dimensions.)
A silhouette score of −0.3 for a point means: (Answer: The point is closer to a different cluster than its own — it is likely misclassified. Values near 0 = on cluster boundary; near +1 = well inside own cluster; near −1 = in wrong cluster.)

On LumiChats

K-Means is used in LLM research for clustering sentence embeddings — understanding what topics a model knows about, finding which document chunks are similar, and compressing token vocabularies (K-Means on character n-gram embeddings was used in early subword tokenisation methods).

Try it free

K-Means and K-Medoid Clustering

Real-life analogy: Organising files into folders

K-Means algorithm step by step

Choosing K — Elbow method and Silhouette

Practice questions (GATE-style)

Try LumiChats for ₹69

Related Terms