All courses > School Subjects > Statistics ::

Computation in Plain Language: MCMC and Variational Inference Concepts

Capítulo 22

Estimated reading time: 15 minutes

Why Computation Matters: When Posteriors Don’t Have a Closed Form

In many real-world Bayesian models, the posterior distribution cannot be written in a neat formula you can compute directly. The model might include multiple parameters, nonlinear relationships, hierarchical structure, or nonconjugate likelihoods. In those cases, “doing Bayesian inference” becomes a computational task: we need a practical way to approximate the posterior well enough to answer decision questions (for example, “What is the probability this parameter exceeds a threshold?” or “What outcomes should we expect next?”). Two workhorse approaches dominate modern practice: Markov chain Monte Carlo (MCMC), which approximates the posterior by drawing samples, and variational inference (VI), which approximates the posterior by optimization. This chapter explains both in plain language, focusing on what they produce, what can go wrong, and how to use them responsibly.

The Core Computational Goal: Expectations Under the Posterior

Most decision-relevant quantities can be written as an expectation under the posterior. Examples include posterior means, probabilities of events, and predictive averages. If θ represents parameters and y represents data, the posterior is p(θ|y). A decision quantity often looks like E[g(θ)] where g is some function (a threshold indicator, a cost function, a predicted outcome, and so on). When we cannot compute E[g(θ)] analytically, we approximate it. MCMC approximates E[g(θ)] using averages over samples from p(θ|y). VI approximates p(θ|y) with a simpler distribution q(θ) and then computes expectations under q(θ). The rest of the chapter is about how these approximations are constructed and how to judge whether they are good enough for decisions.

MCMC in Plain Language: Sampling to Approximate the Posterior

MCMC is a family of algorithms that produce a sequence of parameter values (samples) whose long-run distribution matches the posterior. The key idea is surprisingly simple: if you can generate many draws from the posterior, then you can approximate posterior quantities by Monte Carlo averages. For example, if θ(1), θ(2), …, θ(S) are samples, then E[g(θ)] ≈ (1/S)∑ g(θ(s)). The “MC” in MCMC is this Monte Carlo averaging. The “Markov chain” part is how we generate the samples: each new draw depends on the previous draw, but the chain is designed so that, after a warm-up period, the draws behave like they come from the posterior distribution.

What a Markov Chain Is (Without the Math)

A Markov chain is a step-by-step random walk through parameter space. You start at some initial guess for θ. Then you repeatedly propose a new θ and decide whether to move there. Over time, the chain spends more time in regions of high posterior probability and less time in low-probability regions. If the algorithm is set up correctly and run long enough, the fraction of time the chain spends in any region approximates the posterior probability of that region. This is why histograms of MCMC draws can be interpreted as approximations to posterior marginals.

Metropolis(-Hastings): The Accept/Reject Random Walk

The Metropolis family is the simplest mental model for MCMC. At each step: propose a new θ* near the current θ, compare how plausible θ* is relative to θ under the posterior, and then accept the move with a probability that favors higher posterior density but still allows occasional downhill moves. Those downhill moves are crucial: they prevent the chain from getting stuck and ensure it can explore the full posterior. In practice, Metropolis can work well for low-dimensional problems but becomes inefficient as the number of parameters grows or as parameters are strongly correlated.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Hamiltonian Monte Carlo (HMC): Using Geometry to Move Efficiently

Modern Bayesian software often uses Hamiltonian Monte Carlo (and its adaptive variant NUTS). The plain-language intuition: instead of a random walk that takes small, jittery steps, HMC uses gradient information (how the log posterior changes with θ) to take longer, more directed moves that follow the shape of the posterior. This reduces the “random walk” behavior and can dramatically improve efficiency in higher dimensions. You do not need to derive the physics analogy to use it; the practical takeaway is that HMC tends to produce less correlated samples per unit time, which means more effective information from the same compute budget.

Practical Step-by-Step: Running MCMC for a Real Model

This step-by-step is a workflow you can apply regardless of the specific Bayesian model, as long as you have a tool that can run MCMC (Stan, PyMC, NumPyro, Turing, etc.). The goal is not just to “get samples,” but to get samples you can trust for decisions.

Step 1: Choose Parameterization and Scaling That Helps Sampling

Many sampling problems are actually geometry problems. If parameters live on very different scales (one near 0.001, another near 10,000), the posterior can look like a long, thin canyon that is hard to explore. Practical fixes include standardizing predictors, using log transforms for positive parameters, and using non-centered parameterizations for hierarchical components when appropriate. Even if you do not change the model’s meaning, you can change how it is expressed to make the posterior easier to sample.

Step 2: Run Multiple Chains

Run at least 4 chains starting from different initial values. Multiple chains let you detect non-convergence: if chains disagree, the algorithm has not reliably explored the posterior. A single chain can look stable while still being stuck in one region. Multiple chains are a cheap insurance policy against false confidence.

Step 3: Allocate Warm-Up (Adaptation) and Sampling Iterations

MCMC typically has a warm-up phase where the algorithm tunes internal settings (step sizes, mass matrix, proposal scales) and moves from the initial point toward typical posterior regions. Those warm-up draws are not used for inference. After warm-up, you collect sampling draws. A practical starting point might be 1000 warm-up and 1000 sampling iterations per chain, then adjust based on diagnostics and the complexity of the model.

Step 4: Check Convergence Diagnostics You Can Act On

Key diagnostics include: R-hat (split potential scale reduction), effective sample size (ESS), and trace plots. R-hat close to 1 suggests chains are mixing and agreeing; values meaningfully above 1 indicate problems. ESS tells you how many independent draws your correlated chain is equivalent to; low ESS means high Monte Carlo error. Trace plots should look like “hairy caterpillars” that move around the same region across chains, not like drifting lines or chains stuck at different levels.

Step 5: Watch for Divergences and Pathologies (Especially in HMC)

With HMC/NUTS, divergences are a red flag that the sampler is struggling with regions of high curvature in the posterior. Divergences can invalidate inferences, especially for tail probabilities and uncertainty estimates. Practical responses include increasing the target acceptance rate, reparameterizing (often non-centered for hierarchical parts), tightening priors to rule out extreme regions that create pathological geometry, or rescaling variables.

Step 6: Quantify Monte Carlo Error for Decision Quantities

Even if the chain has converged, estimates have Monte Carlo error because you have a finite number of effective draws. For a decision probability like P(θ > c), you can compute it as the fraction of draws above c, and you can estimate its standard error roughly as sqrt(p(1-p)/ESS) where p is the estimated probability and ESS is the effective sample size for that indicator. If your decision threshold is tight (for example, act if probability exceeds 0.95), you want Monte Carlo error small enough that it cannot flip the decision.

Step 7: Use Posterior Draws as a Computation Substrate

Once you have draws, you can compute almost anything by transforming them: predicted outcomes, counterfactual scenarios, cost functions, and constraint checks. A practical habit is to store posterior draws and write small functions g(θ) that map draws to decision metrics. This keeps the modeling and decision layers clean: the sampler gives you θ draws; your business logic computes actions from those draws.

What MCMC Gives You (and What It Costs)

MCMC’s main advantage is accuracy: under correct implementation and sufficient runtime, it targets the true posterior. Its main cost is computation time and the need for diagnostics. MCMC can be slow for very large datasets, models with expensive likelihoods, or situations where you need near-real-time updates. It can also be fragile when the posterior has multiple separated modes, strong funnel shapes, or heavy tails. In those cases, you may need careful modeling choices, more compute, or alternative approximations.

Variational Inference in Plain Language: Optimization Instead of Sampling

Variational inference replaces the sampling problem with an optimization problem. Instead of drawing from the true posterior p(θ|y), you pick a family of simpler distributions q(θ;λ) parameterized by λ (for example, a multivariate normal with mean and covariance). Then you choose λ to make q as close as possible to the true posterior. “Close” is typically measured by KL divergence in the direction KL(q||p). Practically, VI finds a q that is easy to work with and tries to match the posterior in regions where q puts probability mass.

The Key Intuition: VI Prefers a Compact Approximation

The KL direction used in standard VI penalizes q for putting mass where the true posterior has little mass, but it is less harsh about missing some posterior mass. This often leads to underestimation of uncertainty: q can be too narrow, especially in the tails or in multimodal posteriors. In decision terms, VI can be excellent for point estimates and some average predictions, but risky when tail probabilities, rare-event risks, or conservative uncertainty bounds matter.

Mean-Field VI: Fast but Often Overconfident

A common choice is mean-field VI, which assumes parameters are independent under q: q(θ)=∏ qj(θj). This makes optimization scalable but can miss posterior correlations. Missing correlations can distort uncertainty for derived quantities (like differences, ratios, or predictions), even if marginal means look reasonable. More expressive variational families (full-rank covariance, normalizing flows) can reduce this issue but increase compute and implementation complexity.

Practical Step-by-Step: Using Variational Inference Responsibly

VI is attractive when you need speed, when datasets are large, or when you need to fit many models (for example, per segment, per product, per day). The workflow below emphasizes guardrails to avoid using VI in situations where it is likely to mislead.

Step 1: Decide Whether VI Is Appropriate for the Decision

Ask what your decision depends on. If you mainly need approximate posterior means and predictions near the center of the distribution, VI can be a good fit. If you need accurate tail probabilities (for example, “probability of loss worse than X”), strict uncertainty quantification, or you suspect multimodality, prefer MCMC or validate VI carefully against MCMC on a subset.

Step 2: Choose a Variational Family That Matches the Posterior’s Shape

Start with a baseline (mean-field or diagonal covariance) for speed, but be ready to upgrade if diagnostics suggest mismatch. If parameters are known to be correlated (common in regression-like models, latent factor models, or hierarchical structures), consider full-rank Gaussian VI or structured VI that preserves key dependencies. If the posterior is skewed or heavy-tailed, consider richer families (flows) if available.

Step 3: Optimize the ELBO and Monitor Stability

VI typically maximizes the evidence lower bound (ELBO). In practice, you run an optimizer (often stochastic gradient ascent) and watch the ELBO curve. It should increase and then stabilize. Instability, oscillation, or sensitivity to learning rates can indicate a difficult posterior geometry or a variational family that cannot represent the posterior well. Run multiple random initializations; if they converge to different ELBO values or different parameter estimates, the approximation may be unreliable.

Step 4: Perform Predictive Checks Using the Variational Posterior

Even if the ELBO looks good, you still need to check whether the fitted model generates realistic data. Use draws from q(θ) to simulate predictions and compare them to observed patterns. If predictive behavior is wrong, the issue might be the model, the approximation, or both. VI can sometimes hide model problems by producing overly confident parameter estimates that still yield poor predictive fit.

Step 5: Calibrate Uncertainty for Decision Thresholds

If your decision uses a probability threshold (for example, act if P(metric > target) > 0.9), test how sensitive that probability is to the inference method. A practical approach is to run MCMC on a smaller sample of the data or on a simplified version of the model and compare key decision quantities. If VI consistently produces higher certainty (more extreme probabilities) than MCMC, treat VI probabilities as optimistic and adjust thresholds or switch methods.

Step 6: Use VI as an Initialization or Screening Tool

A common pragmatic pattern is: use VI to get fast approximate posteriors across many candidates, then run MCMC on the finalists or on the cases where the decision is close. VI can also provide good initial values for MCMC, reducing warm-up time. This hybrid approach often delivers most of VI’s speed while keeping MCMC’s reliability where it matters.

MCMC vs VI: How to Choose for Real-World Decisions

Choosing between MCMC and VI is less about ideology and more about matching the tool to the decision’s risk profile and operational constraints. MCMC is typically the default when you need trustworthy uncertainty and can afford the compute. VI is typically the default when you need speed, scale, or frequent refits, and when approximate uncertainty is acceptable. A useful mental model: MCMC spends computation to reduce approximation bias; VI accepts approximation bias to reduce computation time.

Decision-Focused Comparison Checklist

Need accurate tail risk? Prefer MCMC; VI may underestimate tails.
Need fast iteration or many fits? Prefer VI; validate on a subset with MCMC.
Posterior likely multimodal? MCMC can still struggle but is generally more faithful; VI often collapses to one mode.
Strong parameter correlations? MCMC (especially HMC) handles this well; mean-field VI can misrepresent uncertainty unless structured.
Decision is near a threshold? Use the more reliable method or increase compute until Monte Carlo/approximation error cannot flip the decision.

Reading Diagnostics as Decision Risk Signals

Diagnostics are not just technicalities; they are signals about decision risk. For MCMC, poor mixing, high R-hat, low ESS, or divergences mean your computed probabilities and intervals may be wrong in ways that are hard to predict. For VI, a high ELBO does not guarantee accurate uncertainty; sensitivity to initialization, mismatch in predictive checks, or large differences versus MCMC on a subset are warnings. A practical habit is to translate diagnostic issues into business language: “Our uncertainty estimate may be too narrow,” “Tail risk may be underestimated,” or “The probability of exceeding the target may be off by several percentage points.”

Worked Example Pattern: Computing a Decision Probability from Posterior Output

Regardless of whether you used MCMC or VI, you often end up with draws: either true posterior draws (MCMC) or approximate draws from q(θ) (VI). Suppose your decision depends on whether a profit metric exceeds zero under uncertainty. You define a function profit(θ) that maps parameters to profit under your scenario. Then compute the probability of positive profit as the fraction of draws where profit(θ)>0. You can also compute expected profit as the average profit(θ) over draws, and expected loss relative to an alternative as the average of max(0, profit_alt(θ)-profit(θ)). This “define a function, evaluate on draws, average” pattern is the practical bridge from Bayesian computation to decisions.

Pseudocode Template: From Draws to Decision Metrics

# Inputs: draws of parameters theta[s], s=1..S (from MCMC or VI)  # Define a scenario-specific function  function profit(theta):      # compute predicted revenue, cost, etc.      return revenue(theta) - cost(theta)  # Compute decision quantities  profits = [profit(theta[s]) for s in 1..S]  p_positive = mean([1 if p > 0 else 0 for p in profits])  expected_profit = mean(profits)  downside_risk = mean([max(0, -p) for p in profits])  # Monte Carlo uncertainty (use ESS if available; otherwise S as rough proxy)  se_p = sqrt(p_positive*(1-p_positive)/S)

Common Failure Modes and Practical Fixes

MCMC Failure Mode: Chains Don’t Mix

If trace plots show chains stuck or moving slowly, you may have strong correlations, poor scaling, or a challenging posterior shape. Practical fixes: standardize inputs, reparameterize, tighten priors to remove unrealistic extremes, increase warm-up, or use HMC/NUTS if you were using a simpler sampler. If the posterior is multimodal, consider whether the model is identifiable; sometimes the right fix is to change the model rather than to push the sampler harder.

MCMC Failure Mode: Divergences in HMC

Divergences often indicate a “funnel” geometry or extreme curvature. Practical fixes: non-centered parameterizations for hierarchical components, stronger priors on scale parameters, increasing target acceptance, and checking whether the likelihood is creating near-deterministic constraints. Treat unresolved divergences as a serious warning for uncertainty estimates.

VI Failure Mode: Overconfident Uncertainty

If VI yields very tight intervals or extreme probabilities, especially compared to MCMC on a subset, assume underestimation of uncertainty. Practical fixes: use a richer variational family, add structure to capture correlations, or switch to MCMC for the final decision. Another pragmatic fix is to use VI for ranking options but require MCMC confirmation when the top options are close.

VI Failure Mode: Local Optima and Sensitivity to Initialization

If different runs produce different answers, the optimization landscape may be tricky. Practical fixes: multiple restarts, better initialization (possibly from a simpler model), smaller learning rates, or switching to a more expressive variational family. If instability persists, treat VI results as exploratory rather than decision-grade.

Operational Patterns: Making Computation Fit the Workflow

In real organizations, computation must fit time budgets and deployment constraints. MCMC can be scheduled (nightly refits, weekly refreshes) and used for high-stakes decisions. VI can power dashboards and rapid iteration. A robust pattern is tiered inference: run VI broadly for speed, monitor decision-critical metrics, and trigger MCMC automatically when a metric crosses a “needs high accuracy” zone (for example, when a probability is between 0.85 and 0.95 and the action threshold is 0.9). This treats computation as part of a decision pipeline rather than a one-off statistical exercise.

Now answer the exercise about the content:

When a decision depends on an accurate tail probability (for example, a rare-event risk), which approach is typically the safer default and why?

You are right! Congratulations, now go to the next page

You missed! Try again.

MCMC is typically safer for tail-risk decisions because, with correct implementation and diagnostics, it targets the true posterior. Standard VI often prefers a compact q(θ), which can underestimate uncertainty and tail mass, making rare-event probabilities too optimistic.