What Diagnostics Are For (and What They Are Not)
When you fit a Bayesian model with MCMC, you do not get a single “answer.” You get draws (samples) from the posterior distribution. Diagnostics help you judge whether those draws are trustworthy enough to summarize and use for decisions. They do not prove the model is correct, and they do not guarantee your business decision is optimal. They answer a narrower question: did the computation likely explore the posterior distribution well enough that posterior summaries (means, intervals, probabilities, predictions) are stable?
In practice, you want diagnostics that are easy to read and that map to actionable next steps. This chapter focuses on three such diagnostics: chains (and how to read them), convergence (whether chains agree), and effective sample size (how much independent information you really have). We will keep the math light and emphasize what to look for, what can go wrong, and what to do next.
Chains: Multiple Attempts to Explore the Same Posterior
Most MCMC workflows run multiple chains. A chain is one long sequence of parameter values produced by the sampler. Each chain starts from a different initial point and then moves around the posterior. Multiple chains are like multiple independent “attempts” to explore the same landscape. If they all end up describing the same distribution, you gain confidence that the sampler found the right region and mixed well.
Why not run one very long chain? Because one chain can get stuck in a region (or drift slowly) and you might not notice. With multiple chains, disagreement becomes visible: one chain might sit in a different region, or show different variability, or have a different average. Those are red flags you can see quickly.
Trace Plots: The Single Most Useful Visual
A trace plot shows parameter value (vertical axis) versus iteration (horizontal axis) for each chain, usually overlaid in different colors. You are looking for “hairy caterpillars”: chains that wiggle rapidly and overlap each other, with no obvious trends. If the chains look like they are all drawing from the same fuzzy band, that is good.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
- Good trace: chains overlap, no long-term upward/downward drift, no long flat segments, no sudden jumps that persist.
- Bad trace (non-mixing): one chain stays high while another stays low, or chains move very slowly, or you see long plateaus.
- Bad trace (stuck): a chain barely moves, suggesting the sampler cannot explore the posterior in that direction.
- Bad trace (multimodality): chains spend time in different “modes” (separate regions) and rarely move between them.
Practical tip: do not only inspect the main parameter of interest. Also inspect key hyperparameters (e.g., group-level standard deviations in hierarchical models), correlation parameters, and any parameters that control scale. These are common sources of sampling trouble.
Warmup/Burn-in: Why the Beginning Looks Weird
Most samplers use an initial phase (often called warmup or adaptation) to tune step sizes and other internal settings. During warmup, the chain may move differently than later. Many tools discard warmup draws automatically. When you look at trace plots, make sure you are viewing post-warmup samples, or at least know where warmup ends. If the chain still shows drift after warmup, that is more concerning.
Convergence Without Heavy Math: Do Chains Agree?
Convergence, in practical terms, means that the chains are sampling from the same target distribution. You do not need to memorize proofs. You need a reliable signal that the chains have “forgotten” their starting points and are now representative of the posterior.
R-hat (Split R-hat): The Agreement Score
R-hat (often written as \hat{R}) compares how much chains vary within themselves to how much they vary from each other. If chains are exploring the same distribution, then the between-chain and within-chain variability should match closely. Modern implementations often use split R-hat, which is more sensitive to problems by splitting each chain into two halves and checking agreement.
- Rule of thumb: R-hat very close to 1 is good. Many practitioners treat values above 1.01 as worth investigating, and above 1.05 as a clear warning.
- Interpretation: R-hat > 1 suggests chains are not mixing well or are exploring different regions.
- Common trap: R-hat can look fine even if you have too few effective samples (you can have agreement but still be noisy). That is why you also check effective sample size.
How to use it: scan a table of parameters and sort by highest R-hat. Then inspect trace plots for those parameters first. Often, a small subset of parameters reveals the core issue (funnel shapes, strong correlations, weak identifiability).
Rank Plots and Overlap: A Visual Check of Agreement
Some tools provide rank plots (or similar diagnostics) that show whether each chain contributes evenly across the distribution. If one chain tends to produce mostly low ranks and another mostly high ranks, they are not sampling the same distribution. You do not need to compute anything: you are looking for similar shapes across chains.
When “Converged” Still Feels Wrong
Sometimes diagnostics say “fine,” but posterior summaries change noticeably when you rerun the model with a different random seed or slightly different settings. That is a practical sign you may not have enough effective samples or the posterior is difficult (highly correlated, heavy-tailed, or multimodal). Treat stability across reruns as an additional sanity check: if your key decision metric (e.g., probability a treatment beats control by at least a threshold) swings around, you need more reliable sampling.
Effective Sample Size (ESS): How Many Independent Draws You Really Have
MCMC draws are not independent. Each draw is related to the previous one because the chain moves gradually. If successive draws are highly correlated, then 10,000 draws might contain the same information as only a few hundred independent samples. Effective sample size (ESS) estimates how many independent draws your correlated samples are “worth.”
ESS matters because Monte Carlo error (the noise from finite sampling) shrinks as you get more effective samples. If ESS is low, your posterior mean, quantiles, and decision probabilities can be unstable. If ESS is high, those summaries are more precise.
Bulk ESS vs Tail ESS: Two Different Needs
Many tools report two ESS values per parameter: bulk ESS and tail ESS.
- Bulk ESS: how well the sampler explores the center of the distribution. This affects posterior means and medians.
- Tail ESS: how well the sampler explores the extremes (tails). This affects credible interval endpoints and risk calculations that depend on rare events (e.g., probability of a large loss).
If you care about a 95% credible interval, tail ESS is especially important. If you care about an expected value or a typical prediction, bulk ESS may be the main concern. In real decisions, you often care about tails because thresholds and worst-case costs live there.
What Counts as “Enough” ESS?
There is no universal magic number, but you can use practical targets:
- Minimum sanity: bulk ESS in the hundreds for key parameters; tail ESS not tiny.
- Comfortable: bulk ESS in the thousands for key parameters, especially if decisions are sensitive.
- High-stakes thresholds: prioritize tail ESS; if tail ESS is low, your interval endpoints and tail probabilities can be noisy.
Also compare ESS to total draws. If you ran 4 chains × 2,000 post-warmup draws = 8,000 draws total, but ESS is only 200, you are wasting computation on highly correlated samples. That points to mixing problems or a difficult geometry.
Monte Carlo Standard Error (MCSE): The “Noise Bar” on Your Estimates
Some outputs include MCSE for posterior means or quantiles. MCSE is a direct measure of how much sampling noise remains. You can use it as a decision-focused diagnostic: if the MCSE is small compared to the effect size you care about, you are likely fine. If MCSE is large enough to change a decision, you need more effective samples or a better-behaved model.
Example: suppose you care whether a probability is above 0.90. If your estimated probability is 0.91 but the MCSE implies it could easily be 0.88 or 0.94 due to sampling noise, you should not treat 0.91 as stable.
Step-by-Step Workflow: Reading Diagnostics in a Practical Order
Step 1: Confirm the Run Finished Cleanly
Before interpreting any numbers, check for warnings: divergent transitions, maximum tree depth warnings, numerical overflows, or messages about non-finite log probability. These are not “minor.” They often mean the sampler did not explore the posterior correctly. If you see them, treat diagnostics as compromised until addressed.
Step 2: Check Trace Plots for Key Parameters
Pick a small set of parameters tied to your decision and model stability: the main effect(s), key group-level standard deviations, and any correlation parameters. Look for overlap and stationarity (no drift). If you see a chain separated from others, do not proceed to interpret posterior summaries as final.
Step 3: Scan R-hat Across All Parameters
Sort parameters by R-hat descending. Investigate anything meaningfully above 1.00. Use trace plots to understand the pattern: slow mixing, stuck chains, or different modes.
Step 4: Check ESS (Bulk and Tail) for Decision-Critical Quantities
Focus on the parameters and derived quantities you will actually use: differences between groups, predicted outcomes, probabilities of exceeding thresholds, expected losses. Many tools let you compute ESS for transformed quantities too. If your decision uses a derived metric, diagnose that metric, not only the raw parameters.
Step 5: Check Stability by Rerunning (Optional but Powerful)
Rerun the model with a different seed (and ideally slightly more iterations). Compare key posterior summaries. If they move more than you can tolerate for the decision, you need more effective samples or a model/sampler adjustment.
Common Diagnostic Failure Patterns and What to Do
Pattern: Chains Overlap but Move Slowly (Low ESS)
You may see trace plots that overlap but look “sticky,” with long stretches where the chain drifts slowly. R-hat can still be near 1, but ESS will be low.
- What it means: high autocorrelation; the sampler is taking small steps in a correlated posterior.
- What to do: run longer (more post-warmup draws), consider reparameterization (often non-centered parameterizations in hierarchical models), standardize predictors, and check whether priors are creating extreme curvature.
Pattern: One Chain Looks Different (R-hat High)
One chain may sit higher, have different variance, or show a different mean. This often indicates the chain is stuck or exploring a different region.
- What it means: poor mixing, multimodality, or sensitivity to initialization.
- What to do: increase warmup, use better initial values, tighten or regularize priors to reduce extreme regions, re-express parameters to reduce correlations, and consider whether the model is weakly identified.
Pattern: Sudden Jumps and Then Long Plateaus
This can happen when the sampler struggles with geometry and occasionally finds a new region, then gets stuck again.
- What it means: difficult posterior geometry; sometimes related to funnels or near-boundary parameters.
- What to do: reparameterize, rescale parameters, and check for overly broad priors on scale parameters that allow extreme values.
Pattern: Tail ESS Is Much Worse Than Bulk ESS
Your center estimates look stable, but interval endpoints or tail probabilities are noisy.
- What it means: the sampler is not exploring extremes efficiently; tails may be heavy or constrained.
- What to do: run longer, consider stronger regularization if tails are unrealistically wide, and ensure your model is identifiable in the tail behavior (e.g., avoid redundant parameters).
Reading Diagnostics for Derived Quantities (Where Decisions Usually Live)
In real work, you rarely decide based on a raw coefficient alone. You decide based on a derived quantity: a predicted conversion rate for a segment, a difference in revenue, a probability that lift exceeds a minimum, or an expected loss under a cost model. Diagnostics should follow the same path.
Example: Probability of Exceeding a Threshold
Suppose your decision rule depends on P(lift > 0.5%). That probability is computed from posterior draws of lift. If lift has low tail ESS, the probability can be unstable because it depends on how well you sample near the threshold and in the tail beyond it.
- What to check: trace plot of lift (as a derived quantity), R-hat for lift if available, and tail ESS for lift.
- Practical step: compute the probability using each chain separately. If chain-wise probabilities differ meaningfully, you do not have stable sampling yet.
Example: Predicted Outcome for a New Case
If you use posterior predictive draws to estimate a future KPI, check diagnostics for the predictive quantity too. Predictive distributions can be sensitive to tail behavior and to parameters that control variance.
- What to check: ESS for variance/scale parameters and tail ESS for predictions.
- Practical step: compare predictive quantiles across chains. Large differences suggest insufficient mixing in the parts of the posterior that drive uncertainty.
Practical Tactics to Improve Diagnostics (Without Turning This Into a Math Exercise)
Run More, But Only After You Fix Geometry Problems
Increasing iterations can help when the sampler is basically working but needs more effective samples. However, if you have divergences or clearly separated chains, “more draws” may just produce more biased draws. Fix the underlying issue first (reparameterize, rescale, regularize), then extend the run.
Standardize Inputs and Use Reasonable Scales
Many sampling problems come from parameters living on wildly different scales. Standardizing predictors (e.g., mean 0, standard deviation 1) and expressing outcomes in sensible units can improve mixing dramatically. This is not about mathematics; it is about giving the sampler a smoother landscape to explore.
Reparameterize When You See Funnels or Sticky Hierarchies
Hierarchical models often create “funnel” shapes where some parameters become hard to sample when group-level variance is small. A common remedy is a non-centered parameterization. You do not need to derive it by hand to benefit; many Bayesian workflows document it as a modeling pattern. The diagnostic signal that you might need it is: low ESS for group-level standard deviations, divergent transitions, or chains that mix poorly for group effects.
Strengthen Priors When the Posterior Has Unrealistic Extremes
Overly broad priors can allow extreme parameter values that are not plausible in your domain, creating heavy tails and difficult geometry. If diagnostics show poor tail exploration and your posterior includes absurd values, consider more informative (but still reasonable) priors that reflect domain constraints. This is not “cheating”; it is encoding real-world limits to stabilize inference.
A Minimal Diagnostic Checklist You Can Apply Every Time
Checklist: Chains, Convergence, ESS
- Chains: trace plots overlap and look stationary after warmup; no chain is stuck.
- Convergence: R-hat near 1 for all important parameters and derived quantities; investigate anything above ~1.01.
- ESS: bulk ESS and tail ESS are high enough for the decision; MCSE is small relative to the decision threshold or effect size.
- Derived quantities: diagnostics are checked for the metrics you will actually use (differences, probabilities, predictions).
- Stability: rerun with a new seed if the decision is sensitive; key summaries should not swing meaningfully.
This checklist is intentionally practical: it tells you what to look at, what “good” looks like, and how to connect diagnostics to decision stability. The goal is not perfect sampling aesthetics; the goal is posterior summaries you can rely on when making real-world choices.