Definition

Maximum Likelihood Estimation (MLE) finds the model parameters that make the observed training data most probable. Maximum A Posteriori (MAP) extends MLE by incorporating a prior belief about parameters (Bayesian inference). MLE maximises P(data|θ); MAP maximises P(θ|data) = P(data|θ) × P(θ). Probabilistic ML provides a principled framework for uncertainty quantification, model comparison, and principled regularisation. Gaussian Mixture Models (GMM) are a key probabilistic model for soft clustering and density estimation.

MLE — making observed data most likely

MLE: find θ that maximises log-likelihood (log-sum is numerically stable vs product). For Gaussian: MLE gives sample mean and variance. For logistic regression: MLE gives cross-entropy minimisation. For linear regression: MLE under Gaussian noise gives OLS.

MLE for Gaussian parameters and logistic regression

import numpy as np
from scipy import stats
from scipy.optimize import minimize

# MLE for Gaussian distribution parameters
np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=1000)   # True μ=5, σ=2

# Analytical MLE for Gaussian:
mu_mle    = np.mean(data)         # MLE estimate of mean
sigma_mle = np.std(data, ddof=0)  # MLE estimate of std (biased — divides by n, not n-1)
sigma_unb = np.std(data, ddof=1)  # Unbiased estimate (UMVUE — divides by n-1)
print(f"True μ=5.0,   MLE μ = {mu_mle:.4f}")
print(f"True σ=2.0,   MLE σ = {sigma_mle:.4f}  Unbiased σ = {sigma_unb:.4f}")

# Numerical MLE via log-likelihood maximisation
def neg_log_likelihood(params, data):
    mu, sigma = params
    if sigma <= 0: return np.inf
    return -np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))

result = minimize(neg_log_likelihood, x0=[0, 1], args=(data,), method='Nelder-Mead')
print(f"Numerical MLE: μ = {result.x[0]:.4f}, σ = {result.x[1]:.4f}")

# MLE for coin flip (Bernoulli distribution)
# P(k heads out of n flips | p) = C(n,k) p^k (1-p)^(n-k)
# MLE: p_hat = k/n (fraction of heads)
n_flips, n_heads = 100, 65
p_mle = n_heads / n_flips
print(f"
Coin MLE: p = {p_mle:.2f}")   # 0.65

# MLE = OLS for linear regression (under Gaussian noise assumption)
# MLE = cross-entropy minimisation for logistic regression
# → Both are maximising likelihood of the observed labels

MAP — incorporating prior knowledge

MAP adds log prior log P(θ) to the log-likelihood. With Gaussian prior N(0, 1/λ): log P(θ) = -λΣθⱼ² = L2 regularisation. With Laplace prior: log P(θ) = -λΣ|θⱼ| = L1 regularisation. MAP = MLE + regularisation. Ridge regression IS MAP with Gaussian prior. Lasso IS MAP with Laplace prior.

Method	Objective	Prior	Equivalent ML method
MLE	max P(D\|θ)	None (uniform)	OLS linear regression, cross-entropy
MAP with Gaussian prior	max P(D\|θ) P(θ)	N(0, σ²)	Ridge regression (L2)
MAP with Laplace prior	max P(D\|θ) P(θ)	Laplace(0, b)	Lasso regression (L1)
Full Bayesian	Posterior P(θ\|D)	Any prior	Bayesian Neural Networks, GPs

Gaussian Mixture Models (GMM) — soft probabilistic clustering

GMM models the data as a mixture of K Gaussian distributions. Unlike K-Means (hard assignment), GMM gives each point a probability of belonging to each cluster — soft assignment. Parameters: K mixture weights (πₖ), K means (μₖ), K covariance matrices (Σₖ). Estimated via the EM algorithm (Expectation-Maximisation).

Gaussian Mixture Models for soft clustering and density estimation

from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
import numpy as np

# Create dataset with 3 Gaussian clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.0, 0.3], random_state=42)

# GMM fitting
gmm = GaussianMixture(
    n_components=3,
    covariance_type='full',   # Each cluster has its own full covariance matrix
    # 'tied': all clusters share same covariance
    # 'diag': diagonal covariance (faster, assumes independent features)
    # 'spherical': single variance per cluster (like K-Means)
    max_iter=100,
    n_init=5,          # Multiple initialisations (like K-Means++)
    random_state=42
)
gmm.fit(X)

# Hard assignment (most likely cluster)
y_pred_hard = gmm.predict(X)

# SOFT assignment (probability of each cluster) — key difference from K-Means!
y_proba = gmm.predict_proba(X)
print("Soft assignments (first 5 rows):")
print(y_proba[:5].round(3))
# [[0.99, 0.01, 0.00],   ← 99% cluster 0, almost certainly in cluster 0
#  [0.45, 0.50, 0.05],   ← Ambiguous: 45% cluster 0, 50% cluster 1
#  [0.00, 0.00, 1.00],   ← Definitely in cluster 2

# Density estimation: GMM is a generative model
# Can compute log-likelihood of ANY data point
log_likelihoods = gmm.score_samples(X)  # Log-likelihood per sample
anomaly_threshold = np.percentile(log_likelihoods, 5)  # 5th percentile = anomalies
anomalies = X[log_likelihoods < anomaly_threshold]
print(f"
Anomalies detected: {len(anomalies)} (lowest 5% log-likelihood)")

# Generate new synthetic data from the learned distribution
X_new = gmm.sample(50)[0]   # 50 new points sampled from GMM
print(f"Generated {len(X_new)} synthetic points")

# Model selection: choose K using BIC (lower is better)
bic_scores = []
for k in range(1, 8):
    gmm_k = GaussianMixture(n_components=k, n_init=3, random_state=42)
    gmm_k.fit(X)
    bic_scores.append(gmm_k.bic(X))
    print(f"K={k}: BIC = {gmm_k.bic(X):.1f}")
best_k = np.argmin(bic_scores) + 1
print(f"Optimal K (BIC): {best_k}")

Property	K-Means	GMM
Cluster shape	Spherical (Euclidean distance)	Elliptical (covariance matrix)
Assignment	Hard (0 or 1 per cluster)	Soft (probability per cluster)
Model type	Discriminative / distance-based	Generative / probabilistic
Can generate new data?	No	Yes — sample from mixture
Outlier detection	No	Yes — low likelihood = outlier
Convergence	Fast (Lloyd algorithm)	Slower (EM algorithm)

Practice questions

MLE for a fair coin (n=100 flips, k=55 heads). What is p_hat_MLE? (Answer: p_hat = k/n = 55/100 = 0.55. MLE simply maximises the probability of observing the data, which for a Bernoulli is the sample mean.)
How does Ridge regression relate to MAP estimation? (Answer: Ridge = MAP with a Gaussian prior on weights θ ~ N(0, 1/λI). The L2 penalty term λΣθⱼ² is exactly -log P(θ) for a Gaussian prior. Minimising MSE + λΣθⱼ² = maximising log P(data|θ) + log P(θ) = MAP.)
What does the EM algorithm do in GMM training? (Answer: E-step (Expectation): compute soft assignments — probability each point belongs to each cluster given current parameters. M-step (Maximisation): update cluster means, covariances, and weights to maximise expected likelihood given soft assignments. Repeat until convergence.)
GMM soft assignment gives [0.45, 0.55, 0.00] for a data point. What does this mean? (Answer: The point is 45% likely to belong to cluster 0 and 55% to cluster 1 — it is on the boundary between the two clusters. K-Means would force a hard assignment; GMM quantifies this ambiguity.)
Why is BIC used to choose K in GMM? (Answer: More clusters = higher likelihood (more parameters can fit the data better). BIC penalises model complexity: BIC = -2 log(likelihood) + k × log(n), where k = number of parameters. Penalises adding unnecessary clusters. Choose K that minimises BIC — balances fit and complexity.)

On LumiChats

GMM is used in speech recognition (acoustic models), image segmentation (pixel clustering), and anomaly detection. MLE and MAP underpin virtually every ML algorithm. LumiChats can derive MLE estimators for any distribution you specify and explain the probabilistic interpretation of your ML model.

Try it free

MLE, MAP & Probabilistic Machine Learning

MLE — making observed data most likely

MAP — incorporating prior knowledge

Gaussian Mixture Models (GMM) — soft probabilistic clustering

Practice questions

Try LumiChats for ₹69

Related Terms