Approximate inference methods estimate posterior probabilities in Bayesian networks when exact inference (e.g., variable elimination) is intractable due to large treewidth or network size. The main sampling-based approaches are: Direct (Prior) Sampling, Rejection Sampling, Likelihood Weighting (LW), and Markov Chain Monte Carlo (MCMC) methods including Gibbs Sampling. GATE DA tests the mechanics of each method, their relative efficiency, when each is used, and numerical computations involving likelihood weights.
Why approximate inference?
Exact inference is #P-hard in general (harder than NP) for arbitrary Bayesian networks. When the network has many variables, large domains, or dense connectivity, variable elimination becomes impractical. Sampling-based methods trade exact answers for scalability: they converge to the true distribution as the number of samples N grows.
| Method | Handles evidence? | Efficiency | Sample independence | Error type |
|---|---|---|---|---|
| Direct sampling | No | High (fast samples) | i.i.d. | Monte Carlo estimation error O(1/√N) |
| Rejection sampling | Yes, by rejecting non-matching | Low for rare evidence | i.i.d. (conditioned) | Unbiased; high variance for low P(E) |
| Likelihood weighting | Yes, via importance weights | Better than rejection | i.i.d. | Unbiased; degrades with many evidence vars |
| Gibbs sampling (MCMC) | Yes, naturally incorporated | Best for complex evidence | Correlated (Markov chain) | Asymptotically unbiased; burn-in needed |
Direct (Prior) Sampling
Direct sampling generates complete assignments of all network variables by sampling each variable in topological order from its CPT given already-sampled parent values. No evidence is incorporated — this samples from the prior distribution P(X₁,…,Xₙ).
Direct sampling from a Bayesian network
import random
def direct_sample(bn, topological_order):
"""Generate one complete sample from the prior distribution P(X)."""
sample = {}
for var in topological_order:
parents = bn.parents[var]
parent_vals = tuple(sample[p] for p in parents)
# Sample from P(var | parent_vals)
prob_true = bn.cpt[var][parent_vals][True]
sample[var] = (random.random() < prob_true)
return sample
def estimate_prior_probability(bn, query_var, query_val, N=10000):
"""Estimate P(query_var=query_val) from N direct samples."""
count = sum(
1 for _ in range(N)
if direct_sample(bn, bn.topological_order)[query_var] == query_val
)
return count / NConsistency: the estimated P(Q=q) converges to the true prior probability by the law of large numbers. However, direct sampling cannot handle evidence — to condition on E=e, a separate mechanism is needed.
Rejection Sampling
Rejection sampling handles evidence by generating samples from the prior and rejecting those that do not match the evidence E=e. P(Q=q|E=e) ≈ (# accepted samples where Q=q) / (# accepted samples).
Critical limitation: rare evidence
If P(E=e) = 0.001, only 1 in 1,000 prior samples will be accepted. To get 1,000 useful estimates, you need ~1,000,000 total samples — extremely wasteful. Rejection sampling is impractical when evidence has low prior probability. Likelihood weighting solves this.
Likelihood Weighting
Likelihood Weighting (LW) fixes evidence variables to their observed values (no sampling) and assigns each sample a weight proportional to the probability of the evidence occurring given the sampled non-evidence variable values. Every sample contributes — no sample is wasted.
Likelihood weight: product of CPT probabilities for each evidence variable given its sampled parent values. High-weight samples are more consistent with the observed evidence.
Likelihood Weighting — the key GATE DA algorithm to know
def lw_sample(bn, evidence, topological_order):
"""One likelihood-weighted sample.
Evidence variables are FIXED to observed values; weight accumulates."""
sample = {}
weight = 1.0
for var in topological_order:
parent_vals = tuple(sample[p] for p in bn.parents[var])
if var in evidence:
# Fix to observed value; multiply weight by P(E_i=e_i | parents)
e_val = evidence[var]
sample[var] = e_val
weight *= bn.cpt[var][parent_vals][e_val]
else:
# Sample non-evidence variable normally from its CPT
prob_true = bn.cpt[var][parent_vals][True]
sample[var] = (random.random() < prob_true)
return sample, weight
def lw_estimate(bn, query_var, query_val, evidence, N=10000):
"""P(query_var=query_val | evidence) via likelihood weighting."""
weighted_count = total_weight = 0.0
for _ in range(N):
s, w = lw_sample(bn, evidence, bn.topological_order)
total_weight += w
if s[query_var] == query_val:
weighted_count += w
return weighted_count / total_weight # normalized weighted estimateThe LW estimate converges: lim_{N→∞} (Σᵢ w(xᵢ)·𝟙[Qᵢ=q]) / (Σᵢ w(xᵢ)) = P(Q=q|E=e). Key insight: evidence variables downstream of query variables are correctly handled because their weights adjust for the constraint imposed by the evidence.
Weight degeneracy problem
When there are many evidence variables or the evidence has low probability, most samples receive near-zero weight and a few samples dominate the estimate. This is called weight degeneracy. Gibbs sampling is more robust in this situation.
Gibbs Sampling (MCMC)
Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. It starts from a random complete assignment of all non-evidence variables and then iteratively: picks a non-evidence variable Xᵢ, samples a new value from P(Xᵢ | Markov_blanket(Xᵢ)) while keeping all other variables fixed. The chain's stationary distribution is P(X|E=e).
Gibbs full conditional: the probability of Xᵢ given its Markov blanket. mb(Xᵢ) = parents, children, and children's other parents. Only local CPTs are needed — the rest of the network is irrelevant given the blanket.
| Algorithm | Sample type | Every sample used? | Key limitation |
|---|---|---|---|
| Direct sampling | i.i.d. from prior | Yes (all) | Cannot condition on evidence |
| Rejection sampling | i.i.d. from posterior | No (rejects non-matching) | Wastes samples when P(E) is small |
| Likelihood weighting | Weighted from prior | Yes (all, with weights) | Weight degeneracy for low P(E) |
| Gibbs sampling | Correlated Markov chain | Yes (after burn-in) | Correlated samples; needs burn-in; slow mixing |
Burn-in and mixing time
The Gibbs chain takes a burn-in period before converging to the target distribution P(X|E=e). Samples during burn-in are discarded. After burn-in, samples are correlated (not i.i.d.), increasing variance. Mixing time — how long until the chain "forgets" its starting state — depends on the network structure. Slowly mixing chains require many more samples for accurate estimates.
GATE PYQ Spotlight
GATE DA Sample — Likelihood weights calculation (NAT, 2-mark)
A Bayesian network with variables A, B, C, D (all Bernoulli). Evidence: C=¬c, D=¬d. Four samples are generated using LW with the following assignments for non-evidence variables: s₁=(¬a,¬b), s₂=(¬a,b), s₃=(¬a,¬b), s₄=(¬a,b) (C and D are fixed to their evidence values in each sample) Compute w(sᵢ) = P(C=¬c|Parents(C) in sᵢ) × P(D=¬d|Parents(D) in sᵢ) for each sample. To estimate P(A=¬a|¬c,¬d): since all 4 samples have A=¬a, the weighted estimate is Σw(sᵢ)/Σw(sᵢ) = 1.0 — all sampled non-evidence states have ¬a. This GATE question type requires careful lookup of CPT values and weight multiplication.
GATE DA — Comparing sampling methods (1-mark)
Which statement about Gibbs sampling compared to Likelihood Weighting is TRUE? (A) Gibbs generates independent samples (B) Gibbs requires the full joint distribution (C) Gibbs is more robust when likelihood weights become very small (D) Gibbs cannot be used when evidence variables have children Answer: (C). When LW suffers from weight degeneracy, Gibbs sampling is the preferred alternative. Gibbs samples are NOT independent — they form a Markov chain (A is false). Gibbs only needs local CPTs (B is false). (D) is false — evidence variables are simply fixed; their children are sampled normally.
Practice questions
- What is the core idea of MCMC (Markov Chain Monte Carlo) sampling? (Answer: MCMC: construct a Markov chain whose stationary distribution equals the target distribution P(x). Sample from the chain by running it long enough to 'mix' (reach stationarity). Then use samples to estimate expectations: E[f(X)] ≈ (1/N)Σf(xᵢ). The key: we don't need to compute P(x) normalised — only need to evaluate P(x) up to a constant (the Metropolis acceptance criterion uses ratios P(x')/P(x) where the constant cancels).)
- What is the difference between the Metropolis-Hastings algorithm and Gibbs sampling? (Answer: Metropolis-Hastings: propose x' from any proposal distribution Q(x'|x); accept with probability min(1, P(x')Q(x|x') / P(x)Q(x'|x)). General — works for any distribution. Gibbs sampling: a special case of MH where each variable is sampled from its full conditional P(xᵢ|x_{-i}). Acceptance rate = 1 always (no rejection). Efficient when full conditionals are easy to sample (conjugate priors in Bayesian models). Requires knowing all full conditionals.)
- What is the 'burn-in' period in MCMC and how do you determine its length? (Answer: Burn-in: the initial samples before the chain reaches stationarity — discarded because they depend on the initialisation rather than the target distribution. How long: run multiple chains from different starting points; burn-in ends when chains converge to the same distribution (Gelman-Rubin diagnostic R̂ < 1.1). Alternatively: monitor trace plots for when autocorrelation drops and the chain mixes. Typical burn-in: 10–50% of total samples for well-mixed chains.)
- What is variational inference and how does it differ from MCMC? (Answer: Variational inference (VI): approximate P(z|x) with a simpler distribution Q(z; λ) from a tractable family (e.g., mean-field Gaussian). Optimise λ to minimise KL(Q||P) — turns inference into an optimisation problem. MCMC: generates exact (asymptotically) samples from P(z|x) but can be slow to converge. VI: faster (gradient-based optimisation), scalable to large datasets, gives a closed-form approximation. MCMC: exact in the limit but slow. Modern deep learning (VAEs, normalising flows) uses VI for scalable approximate inference.)
- What is the effective sample size (ESS) in MCMC and why is it important? (Answer: MCMC samples are autocorrelated — consecutive samples are not independent. ESS = N / (1 + 2Σ ρ_k) where ρ_k is autocorrelation at lag k. ESS << N means you have fewer independent samples than actual samples. Example: 10,000 MCMC samples with ESS=500 provides the same statistical power as 500 independent samples. Report ESS, not raw N. Improve ESS: better proposal distributions, HMC (Hamiltonian Monte Carlo) uses gradient information to take large steps and reduces autocorrelation dramatically.)