Maximum Likelihood Estimation (MLE) finds the model parameters that make the observed training data most probable. Maximum A Posteriori (MAP) extends MLE by incorporating a prior belief about parameters (Bayesian inference). MLE maximises P(data|θ); MAP maximises P(θ|data) = P(data|θ) × P(θ). Probabilistic ML provides a principled framework for uncertainty quantification, model comparison, and principled regularisation. Gaussian Mixture Models (GMM) are a key probabilistic model for soft clustering and density estimation.
MLE — making observed data most likely
MLE: find θ that maximises log-likelihood (log-sum is numerically stable vs product). For Gaussian: MLE gives sample mean and variance. For logistic regression: MLE gives cross-entropy minimisation. For linear regression: MLE under Gaussian noise gives OLS.
MLE for Gaussian parameters and logistic regression
import numpy as np
from scipy import stats
from scipy.optimize import minimize
# MLE for Gaussian distribution parameters
np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=1000) # True μ=5, σ=2
# Analytical MLE for Gaussian:
mu_mle = np.mean(data) # MLE estimate of mean
sigma_mle = np.std(data, ddof=0) # MLE estimate of std (biased — divides by n, not n-1)
sigma_unb = np.std(data, ddof=1) # Unbiased estimate (UMVUE — divides by n-1)
print(f"True μ=5.0, MLE μ = {mu_mle:.4f}")
print(f"True σ=2.0, MLE σ = {sigma_mle:.4f} Unbiased σ = {sigma_unb:.4f}")
# Numerical MLE via log-likelihood maximisation
def neg_log_likelihood(params, data):
mu, sigma = params
if sigma <= 0: return np.inf
return -np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
result = minimize(neg_log_likelihood, x0=[0, 1], args=(data,), method='Nelder-Mead')
print(f"Numerical MLE: μ = {result.x[0]:.4f}, σ = {result.x[1]:.4f}")
# MLE for coin flip (Bernoulli distribution)
# P(k heads out of n flips | p) = C(n,k) p^k (1-p)^(n-k)
# MLE: p_hat = k/n (fraction of heads)
n_flips, n_heads = 100, 65
p_mle = n_heads / n_flips
print(f"
Coin MLE: p = {p_mle:.2f}") # 0.65
# MLE = OLS for linear regression (under Gaussian noise assumption)
# MLE = cross-entropy minimisation for logistic regression
# → Both are maximising likelihood of the observed labelsMAP — incorporating prior knowledge
MAP adds log prior log P(θ) to the log-likelihood. With Gaussian prior N(0, 1/λ): log P(θ) = -λΣθⱼ² = L2 regularisation. With Laplace prior: log P(θ) = -λΣ|θⱼ| = L1 regularisation. MAP = MLE + regularisation. Ridge regression IS MAP with Gaussian prior. Lasso IS MAP with Laplace prior.
| Method | Objective | Prior | Equivalent ML method |
|---|---|---|---|
| MLE | max P(D|θ) | None (uniform) | OLS linear regression, cross-entropy |
| MAP with Gaussian prior | max P(D|θ) P(θ) | N(0, σ²) | Ridge regression (L2) |
| MAP with Laplace prior | max P(D|θ) P(θ) | Laplace(0, b) | Lasso regression (L1) |
| Full Bayesian | Posterior P(θ|D) | Any prior | Bayesian Neural Networks, GPs |
Gaussian Mixture Models (GMM) — soft probabilistic clustering
GMM models the data as a mixture of K Gaussian distributions. Unlike K-Means (hard assignment), GMM gives each point a probability of belonging to each cluster — soft assignment. Parameters: K mixture weights (πₖ), K means (μₖ), K covariance matrices (Σₖ). Estimated via the EM algorithm (Expectation-Maximisation).
Gaussian Mixture Models for soft clustering and density estimation
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
import numpy as np
# Create dataset with 3 Gaussian clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.0, 0.3], random_state=42)
# GMM fitting
gmm = GaussianMixture(
n_components=3,
covariance_type='full', # Each cluster has its own full covariance matrix
# 'tied': all clusters share same covariance
# 'diag': diagonal covariance (faster, assumes independent features)
# 'spherical': single variance per cluster (like K-Means)
max_iter=100,
n_init=5, # Multiple initialisations (like K-Means++)
random_state=42
)
gmm.fit(X)
# Hard assignment (most likely cluster)
y_pred_hard = gmm.predict(X)
# SOFT assignment (probability of each cluster) — key difference from K-Means!
y_proba = gmm.predict_proba(X)
print("Soft assignments (first 5 rows):")
print(y_proba[:5].round(3))
# [[0.99, 0.01, 0.00], ← 99% cluster 0, almost certainly in cluster 0
# [0.45, 0.50, 0.05], ← Ambiguous: 45% cluster 0, 50% cluster 1
# [0.00, 0.00, 1.00], ← Definitely in cluster 2
# Density estimation: GMM is a generative model
# Can compute log-likelihood of ANY data point
log_likelihoods = gmm.score_samples(X) # Log-likelihood per sample
anomaly_threshold = np.percentile(log_likelihoods, 5) # 5th percentile = anomalies
anomalies = X[log_likelihoods < anomaly_threshold]
print(f"
Anomalies detected: {len(anomalies)} (lowest 5% log-likelihood)")
# Generate new synthetic data from the learned distribution
X_new = gmm.sample(50)[0] # 50 new points sampled from GMM
print(f"Generated {len(X_new)} synthetic points")
# Model selection: choose K using BIC (lower is better)
bic_scores = []
for k in range(1, 8):
gmm_k = GaussianMixture(n_components=k, n_init=3, random_state=42)
gmm_k.fit(X)
bic_scores.append(gmm_k.bic(X))
print(f"K={k}: BIC = {gmm_k.bic(X):.1f}")
best_k = np.argmin(bic_scores) + 1
print(f"Optimal K (BIC): {best_k}")| Property | K-Means | GMM |
|---|---|---|
| Cluster shape | Spherical (Euclidean distance) | Elliptical (covariance matrix) |
| Assignment | Hard (0 or 1 per cluster) | Soft (probability per cluster) |
| Model type | Discriminative / distance-based | Generative / probabilistic |
| Can generate new data? | No | Yes — sample from mixture |
| Outlier detection | No | Yes — low likelihood = outlier |
| Convergence | Fast (Lloyd algorithm) | Slower (EM algorithm) |
Practice questions
- MLE for a fair coin (n=100 flips, k=55 heads). What is p_hat_MLE? (Answer: p_hat = k/n = 55/100 = 0.55. MLE simply maximises the probability of observing the data, which for a Bernoulli is the sample mean.)
- How does Ridge regression relate to MAP estimation? (Answer: Ridge = MAP with a Gaussian prior on weights θ ~ N(0, 1/λI). The L2 penalty term λΣθⱼ² is exactly -log P(θ) for a Gaussian prior. Minimising MSE + λΣθⱼ² = maximising log P(data|θ) + log P(θ) = MAP.)
- What does the EM algorithm do in GMM training? (Answer: E-step (Expectation): compute soft assignments — probability each point belongs to each cluster given current parameters. M-step (Maximisation): update cluster means, covariances, and weights to maximise expected likelihood given soft assignments. Repeat until convergence.)
- GMM soft assignment gives [0.45, 0.55, 0.00] for a data point. What does this mean? (Answer: The point is 45% likely to belong to cluster 0 and 55% to cluster 1 — it is on the boundary between the two clusters. K-Means would force a hard assignment; GMM quantifies this ambiguity.)
- Why is BIC used to choose K in GMM? (Answer: More clusters = higher likelihood (more parameters can fit the data better). BIC penalises model complexity: BIC = -2 log(likelihood) + k × log(n), where k = number of parameters. Penalises adding unnecessary clusters. Choose K that minimises BIC — balances fit and complexity.)
On LumiChats
GMM is used in speech recognition (acoustic models), image segmentation (pixel clustering), and anomaly detection. MLE and MAP underpin virtually every ML algorithm. LumiChats can derive MLE estimators for any distribution you specify and explain the probabilistic interpretation of your ML model.
Try it free