Glossary/Approximate Inference: Sampling Methods
AI Fundamentals

Approximate Inference: Sampling Methods

When exact answers are too expensive — estimating probabilities by generating random samples from the network.


Definition

Approximate inference methods estimate posterior probabilities in Bayesian networks when exact inference (e.g., variable elimination) is intractable due to large treewidth or network size. The main sampling-based approaches are: Direct (Prior) Sampling, Rejection Sampling, Likelihood Weighting (LW), and Markov Chain Monte Carlo (MCMC) methods including Gibbs Sampling. GATE DA tests the mechanics of each method, their relative efficiency, when each is used, and numerical computations involving likelihood weights.

Why approximate inference?

Exact inference is #P-hard in general (harder than NP) for arbitrary Bayesian networks. When the network has many variables, large domains, or dense connectivity, variable elimination becomes impractical. Sampling-based methods trade exact answers for scalability: they converge to the true distribution as the number of samples N grows.

MethodHandles evidence?EfficiencySample independenceError type
Direct samplingNoHigh (fast samples)i.i.d.Monte Carlo estimation error O(1/√N)
Rejection samplingYes, by rejecting non-matchingLow for rare evidencei.i.d. (conditioned)Unbiased; high variance for low P(E)
Likelihood weightingYes, via importance weightsBetter than rejectioni.i.d.Unbiased; degrades with many evidence vars
Gibbs sampling (MCMC)Yes, naturally incorporatedBest for complex evidenceCorrelated (Markov chain)Asymptotically unbiased; burn-in needed

Direct (Prior) Sampling

Direct sampling generates complete assignments of all network variables by sampling each variable in topological order from its CPT given already-sampled parent values. No evidence is incorporated — this samples from the prior distribution P(X₁,…,Xₙ).

Direct sampling from a Bayesian network

import random

def direct_sample(bn, topological_order):
    """Generate one complete sample from the prior distribution P(X)."""
    sample = {}
    for var in topological_order:
        parents = bn.parents[var]
        parent_vals = tuple(sample[p] for p in parents)
        # Sample from P(var | parent_vals)
        prob_true = bn.cpt[var][parent_vals][True]
        sample[var] = (random.random() < prob_true)
    return sample

def estimate_prior_probability(bn, query_var, query_val, N=10000):
    """Estimate P(query_var=query_val) from N direct samples."""
    count = sum(
        1 for _ in range(N)
        if direct_sample(bn, bn.topological_order)[query_var] == query_val
    )
    return count / N

Consistency: the estimated P(Q=q) converges to the true prior probability by the law of large numbers. However, direct sampling cannot handle evidence — to condition on E=e, a separate mechanism is needed.

Rejection Sampling

Rejection sampling handles evidence by generating samples from the prior and rejecting those that do not match the evidence E=e. P(Q=q|E=e) ≈ (# accepted samples where Q=q) / (# accepted samples).

Critical limitation: rare evidence

If P(E=e) = 0.001, only 1 in 1,000 prior samples will be accepted. To get 1,000 useful estimates, you need ~1,000,000 total samples — extremely wasteful. Rejection sampling is impractical when evidence has low prior probability. Likelihood weighting solves this.

Likelihood Weighting

Likelihood Weighting (LW) fixes evidence variables to their observed values (no sampling) and assigns each sample a weight proportional to the probability of the evidence occurring given the sampled non-evidence variable values. Every sample contributes — no sample is wasted.

Likelihood weight: product of CPT probabilities for each evidence variable given its sampled parent values. High-weight samples are more consistent with the observed evidence.

Likelihood Weighting — the key GATE DA algorithm to know

def lw_sample(bn, evidence, topological_order):
    """One likelihood-weighted sample.
    Evidence variables are FIXED to observed values; weight accumulates."""
    sample = {}
    weight = 1.0

    for var in topological_order:
        parent_vals = tuple(sample[p] for p in bn.parents[var])

        if var in evidence:
            # Fix to observed value; multiply weight by P(E_i=e_i | parents)
            e_val = evidence[var]
            sample[var] = e_val
            weight *= bn.cpt[var][parent_vals][e_val]
        else:
            # Sample non-evidence variable normally from its CPT
            prob_true = bn.cpt[var][parent_vals][True]
            sample[var] = (random.random() < prob_true)

    return sample, weight

def lw_estimate(bn, query_var, query_val, evidence, N=10000):
    """P(query_var=query_val | evidence) via likelihood weighting."""
    weighted_count = total_weight = 0.0
    for _ in range(N):
        s, w = lw_sample(bn, evidence, bn.topological_order)
        total_weight += w
        if s[query_var] == query_val:
            weighted_count += w
    return weighted_count / total_weight   # normalized weighted estimate

The LW estimate converges: lim_{N→∞} (Σᵢ w(xᵢ)·𝟙[Qᵢ=q]) / (Σᵢ w(xᵢ)) = P(Q=q|E=e). Key insight: evidence variables downstream of query variables are correctly handled because their weights adjust for the constraint imposed by the evidence.

Weight degeneracy problem

When there are many evidence variables or the evidence has low probability, most samples receive near-zero weight and a few samples dominate the estimate. This is called weight degeneracy. Gibbs sampling is more robust in this situation.

Gibbs Sampling (MCMC)

Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method. It starts from a random complete assignment of all non-evidence variables and then iteratively: picks a non-evidence variable Xᵢ, samples a new value from P(Xᵢ | Markov_blanket(Xᵢ)) while keeping all other variables fixed. The chain's stationary distribution is P(X|E=e).

Gibbs full conditional: the probability of Xᵢ given its Markov blanket. mb(Xᵢ) = parents, children, and children's other parents. Only local CPTs are needed — the rest of the network is irrelevant given the blanket.

AlgorithmSample typeEvery sample used?Key limitation
Direct samplingi.i.d. from priorYes (all)Cannot condition on evidence
Rejection samplingi.i.d. from posteriorNo (rejects non-matching)Wastes samples when P(E) is small
Likelihood weightingWeighted from priorYes (all, with weights)Weight degeneracy for low P(E)
Gibbs samplingCorrelated Markov chainYes (after burn-in)Correlated samples; needs burn-in; slow mixing

Burn-in and mixing time

The Gibbs chain takes a burn-in period before converging to the target distribution P(X|E=e). Samples during burn-in are discarded. After burn-in, samples are correlated (not i.i.d.), increasing variance. Mixing time — how long until the chain "forgets" its starting state — depends on the network structure. Slowly mixing chains require many more samples for accurate estimates.

GATE PYQ Spotlight

GATE DA Sample — Likelihood weights calculation (NAT, 2-mark)

A Bayesian network with variables A, B, C, D (all Bernoulli). Evidence: C=¬c, D=¬d. Four samples are generated using LW with the following assignments for non-evidence variables: s₁=(¬a,¬b), s₂=(¬a,b), s₃=(¬a,¬b), s₄=(¬a,b) (C and D are fixed to their evidence values in each sample) Compute w(sᵢ) = P(C=¬c|Parents(C) in sᵢ) × P(D=¬d|Parents(D) in sᵢ) for each sample. To estimate P(A=¬a|¬c,¬d): since all 4 samples have A=¬a, the weighted estimate is Σw(sᵢ)/Σw(sᵢ) = 1.0 — all sampled non-evidence states have ¬a. This GATE question type requires careful lookup of CPT values and weight multiplication.

GATE DA — Comparing sampling methods (1-mark)

Which statement about Gibbs sampling compared to Likelihood Weighting is TRUE? (A) Gibbs generates independent samples (B) Gibbs requires the full joint distribution (C) Gibbs is more robust when likelihood weights become very small (D) Gibbs cannot be used when evidence variables have children Answer: (C). When LW suffers from weight degeneracy, Gibbs sampling is the preferred alternative. Gibbs samples are NOT independent — they form a Markov chain (A is false). Gibbs only needs local CPTs (B is false). (D) is false — evidence variables are simply fixed; their children are sampled normally.

Practice questions

  1. What is the core idea of MCMC (Markov Chain Monte Carlo) sampling? (Answer: MCMC: construct a Markov chain whose stationary distribution equals the target distribution P(x). Sample from the chain by running it long enough to 'mix' (reach stationarity). Then use samples to estimate expectations: E[f(X)] ≈ (1/N)Σf(xᵢ). The key: we don't need to compute P(x) normalised — only need to evaluate P(x) up to a constant (the Metropolis acceptance criterion uses ratios P(x')/P(x) where the constant cancels).)
  2. What is the difference between the Metropolis-Hastings algorithm and Gibbs sampling? (Answer: Metropolis-Hastings: propose x' from any proposal distribution Q(x'|x); accept with probability min(1, P(x')Q(x|x') / P(x)Q(x'|x)). General — works for any distribution. Gibbs sampling: a special case of MH where each variable is sampled from its full conditional P(xᵢ|x_{-i}). Acceptance rate = 1 always (no rejection). Efficient when full conditionals are easy to sample (conjugate priors in Bayesian models). Requires knowing all full conditionals.)
  3. What is the 'burn-in' period in MCMC and how do you determine its length? (Answer: Burn-in: the initial samples before the chain reaches stationarity — discarded because they depend on the initialisation rather than the target distribution. How long: run multiple chains from different starting points; burn-in ends when chains converge to the same distribution (Gelman-Rubin diagnostic R̂ < 1.1). Alternatively: monitor trace plots for when autocorrelation drops and the chain mixes. Typical burn-in: 10–50% of total samples for well-mixed chains.)
  4. What is variational inference and how does it differ from MCMC? (Answer: Variational inference (VI): approximate P(z|x) with a simpler distribution Q(z; λ) from a tractable family (e.g., mean-field Gaussian). Optimise λ to minimise KL(Q||P) — turns inference into an optimisation problem. MCMC: generates exact (asymptotically) samples from P(z|x) but can be slow to converge. VI: faster (gradient-based optimisation), scalable to large datasets, gives a closed-form approximation. MCMC: exact in the limit but slow. Modern deep learning (VAEs, normalising flows) uses VI for scalable approximate inference.)
  5. What is the effective sample size (ESS) in MCMC and why is it important? (Answer: MCMC samples are autocorrelated — consecutive samples are not independent. ESS = N / (1 + 2Σ ρ_k) where ρ_k is autocorrelation at lag k. ESS << N means you have fewer independent samples than actual samples. Example: 10,000 MCMC samples with ESS=500 provides the same statistical power as 500 independent samples. Report ESS, not raw N. Improve ESS: better proposal distributions, HMC (Hamiltonian Monte Carlo) uses gradient information to take large steps and reduces autocorrelation dramatically.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms