All courses > School Subjects > Statistics ::

Mini Case Study: Diagnosing a Miscalibrated Model with Posterior Predictive Simulations

Capítulo 21

Estimated reading time: 14 minutes

What “miscalibration” looks like in a Bayesian predictive model

Miscalibration means your model’s predictive uncertainty does not match reality. In practice, you see it when events the model assigns 80–90% probability to happen much less often, or when outcomes that should be “rare” under the model occur frequently. This is not the same as “the mean is a bit off.” A model can have an accurate average prediction and still be miscalibrated because it is too confident (predictive intervals too narrow) or not confident enough (intervals too wide). Posterior predictive simulations are a direct way to diagnose this: you simulate new datasets from the fitted model and compare those simulated datasets to what you actually observed. If the observed data looks systematically “impossible” or “unusual” compared to the simulations, the model is telling you it cannot reproduce key features of the world you care about.

In this chapter’s mini case study, we will use posterior predictive simulations to diagnose a model that appears to forecast customer support ticket volume well on average but fails operationally because it underestimates spikes. The goal is not to “prove the model wrong” in an abstract sense; the goal is to find the specific ways it is wrong so you can fix it in a targeted way (e.g., add overdispersion, allow day-of-week effects, or model bursty behavior).

Mini case study setup: Daily ticket counts that break the model

Suppose you run support operations for a SaaS product. You forecast daily inbound tickets to staff the team. You build a Bayesian count model using a Poisson likelihood with a log link and a few predictors: marketing email sent (yes/no), day-of-week, and active user count. You fit the model and it seems “fine” when you look at point predictions: the posterior mean of the expected count tracks the general level of tickets. But the operations team complains: “We keep getting slammed on certain days, and the model never warns us.”

Here is a simplified snapshot of 30 days of observed ticket counts (y). Notice the spikes: 62, 71, 58, 66. The typical day is around 20–35 tickets, but there are occasional surges.

Day:   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Tickets y:
      22 25 21 28 24 26 23 27 29 31 26 24 62 28 25 27 30 29 71 26 24 23 58 28 27 29 31 66 25 24

The fitted Poisson regression produces posterior predictive intervals that are often too narrow. It might predict something like “tomorrow is 28 tickets, 90% interval [22, 35]” even on days where 60+ tickets occur. The question is: is this just bad luck, or is the model structurally incapable of generating those spikes at the frequency we observe?

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Posterior predictive simulations: the core idea in one sentence

Posterior predictive simulation means: sample parameters from the posterior, then simulate new outcomes from the likelihood using those parameters, repeat many times, and compare the simulated outcomes to the observed outcomes using diagnostics that reflect what you care about operationally.

Concretely, if your model is: y_i ~ Poisson(λ_i), log(λ_i) = X_i β, and you have posterior draws β^(s), then for each draw s you compute λ_i^(s) and simulate y_i_rep^(s) ~ Poisson(λ_i^(s)). The collection of replicated datasets {y_rep^(s)} is what the model believes is plausible after seeing the data. If the real data has patterns that are consistently in the tails of this distribution, you have a miscalibration problem.

Step-by-step: A practical workflow to diagnose miscalibration with simulations

Step 1: Decide what “good predictions” mean for the decision

Before plotting anything, define the failure mode. For staffing, the painful errors are usually underprediction of high-volume days. So you care about tail behavior: the maximum daily tickets, the number of days above a threshold (e.g., > 50), and the distribution of day-to-day variability. If you only check the mean, you can miss the operational risk.

Tail risk metric: max(y) over the period.
Surge frequency metric: count of days with y > 50.
Variability metric: standard deviation of y, or the 90th–10th percentile spread.
Calibration metric: how often observed y falls inside the model’s 50% and 90% predictive intervals.

Step 2: Fit the model and extract posterior draws

Fit your Bayesian Poisson regression (or whatever model you currently use). Then extract S posterior draws of the parameters. In practice S might be 1000–4000 draws after warmup. The key is that you will propagate parameter uncertainty into predictions; posterior predictive simulation is not “plug in the posterior mean and simulate once.”

At this stage, you also confirm that the model fitting itself is stable (e.g., chains mix, no divergent transitions if using HMC). But the focus of this chapter is not sampler diagnostics; it is predictive miscalibration.

Step 3: Generate replicated datasets from the posterior predictive distribution

For each posterior draw s, simulate a full replicated dataset y_rep^(s) of the same shape as your observed data (same days, same predictors). This produces a distribution over plausible 30-day sequences of ticket counts.

# Pseudocode (language-agnostic)
for s in 1..S:
  beta_s = draw_from_posterior()
  for i in 1..N_days:
    lambda_is = exp(X[i] dot beta_s)
    y_rep[s,i] = poisson_rng(lambda_is)

Store y_rep as an S-by-N matrix. This is your simulation engine for diagnostics: any statistic you compute on the observed data can be computed on each replicated dataset to see if the observed statistic is typical under the model.

Step 4: Compare the observed data to simulated data using targeted checks

Instead of only plotting y versus predicted mean, compute decision-relevant summaries and compare them to their posterior predictive distributions.

Check A: Do we reproduce the maximum daily volume?

Compute T_max(y) = max_i y_i for the observed data. Then compute T_max(y_rep^(s)) for each simulation. If the observed max is far larger than almost all simulated maxima, the model cannot generate spikes of the observed magnitude.

In our data, max(y) = 71. Under a Poisson model with typical λ around 25–30, the probability of seeing 71 is astronomically small unless λ is also huge on that day. If your predictors do not create such a huge λ, the posterior predictive maxima will cluster far below 71.

# Example summary
T_max_obs = max(y)
T_max_rep[s] = max(y_rep[s,])
ppp_max = mean(T_max_rep >= T_max_obs)  # posterior predictive p-value style

If ppp_max is near 0 (e.g., < 0.01), you have a clear tail misfit: the model underestimates extremes.

Check B: Do we reproduce how often surges happen?

Compute T_surge(y) = count(y_i > 50). In our 30 days, we have four surge days (62, 71, 58, 66), so T_surge_obs = 4. Under a Poisson model with moderate λ, you might see T_surge_rep = 0 almost always. Again, compare observed to simulated.

T_surge_obs = sum(y > 50)
T_surge_rep[s] = sum(y_rep[s,] > 50)
ppp_surge = mean(T_surge_rep >= T_surge_obs)

This check is especially operational: staffing pain is driven by how many surge days you face, not just whether one extreme day is possible.

Check C: Do we reproduce the overall variability (overdispersion)?

Poisson implies Var(y_i | λ_i) = E(y_i | λ_i) for each i. Real ticket data often has variance larger than the mean because of unmodeled heterogeneity, correlated arrivals, incidents, or product outages. Compute T_sd(y) = sd(y) and compare to sd(y_rep^(s)). If observed sd is consistently larger, you have overdispersion relative to Poisson.

T_sd_obs = sd(y)
T_sd_rep[s] = sd(y_rep[s,])
ppp_sd = mean(T_sd_rep >= T_sd_obs)

Overdispersion is a common root cause of miscalibration: the model is too confident because it assumes too little randomness around the mean.

Check D: Interval coverage calibration (are predictive intervals honest?)

For each day i, compute the model’s posterior predictive interval for y_i (e.g., 90% interval from the simulated y_rep draws at that i). Then check what fraction of observed y_i fall inside those intervals. If you target 90% intervals but only cover 70% of observations, the model is overconfident.

# For each i:
lo_i = quantile(y_rep[,i], 0.05)
hi_i = quantile(y_rep[,i], 0.95)
covered_i = (y[i] >= lo_i) && (y[i] <= hi_i)
coverage = mean(covered_i)

Coverage can be computed overall and also conditional on predicted risk. For example, check coverage separately for days where the predicted mean is high versus low. Miscalibration often concentrates in certain regimes (e.g., high-load days).

Interpreting the diagnostics: what the model is telling you

Suppose your posterior predictive results look like this:

Observed max = 71, but simulated maxima are typically 40–48, with ppp_max ≈ 0.00.
Observed surge days > 50: 4, but simulated surge days are almost always 0, with ppp_surge ≈ 0.00.
Observed sd(y) = 14.5, but simulated sd is around 6–8, with ppp_sd ≈ 0.00.
Nominal 90% predictive intervals cover only 75% of days, and almost never cover the spike days.

This pattern is a classic signature: the model’s mean structure might be reasonable, but the likelihood is too restrictive. A Poisson model cannot create enough dispersion unless λ itself varies dramatically across days. If your predictors do not explain the spikes, the model will systematically underestimate tail risk and produce intervals that are too narrow. Operationally, this is dangerous because it creates false reassurance.

Diagnosing the cause: missing structure vs wrong noise model

Posterior predictive simulations tell you “the model cannot reproduce X.” The next step is to hypothesize why. In count forecasting, two broad causes are common:

Missing mean structure: there are predictors or latent states that shift the expected ticket volume (incidents, releases, outages, billing cycles). Without them, λ_i is too smooth.
Wrong dispersion assumptions: even after accounting for predictors, the data is more variable than Poisson allows. You need an overdispersed likelihood (e.g., Negative Binomial) or a mixture model.

You can use additional posterior predictive checks to distinguish these. For example, if spikes cluster around specific known events (release days), then adding an indicator may fix the issue. If spikes occur unpredictably and variability is high even within similar predictor settings, overdispersion is likely.

Model repair option 1: Replace Poisson with Negative Binomial (overdispersion)

A practical first fix for under-dispersed count models is to use a Negative Binomial likelihood, which adds a dispersion parameter and allows Var(y_i) > E(y_i). The mean can remain λ_i = exp(X_i β), but the variance becomes λ_i + λ_i^2 / ϕ (depending on parameterization). Intuitively, it lets the model admit “extra randomness” around the mean, widening predictive intervals and increasing the probability of spikes.

After refitting with Negative Binomial, repeat the same posterior predictive simulation workflow. You want to see improvements specifically in the tail metrics and coverage. A good sign is that the observed max and surge count now fall within the bulk of the simulated distributions (ppp values not near 0 or 1), and 90% intervals cover close to 90% of observations.

# Conceptual model (one common parameterization)
y_i ~ NegBinomial(mean = lambda_i, dispersion = phi)
log(lambda_i) = X_i beta
# Posterior predictive simulation is the same pattern:
# draw (beta_s, phi_s), compute lambda_is, simulate y_rep

Important nuance: “wider intervals” is not automatically “better.” You want intervals that are wide for the right reason: because the data is genuinely variable, not because the model is vague. Posterior predictive checks keep you honest by tying uncertainty to observed variability patterns.

Model repair option 2: Add a latent “incident day” component (mixture / burst model)

Sometimes overdispersion alone is not enough. If spikes are rare but huge, a single dispersion parameter may still struggle: it will either not generate big enough spikes, or it will inflate variance everywhere and over-widen intervals on normal days. A targeted alternative is a mixture model: most days are “normal,” but some days are “incident” days with a much higher rate.

One simple structure is:

z_i ~ Bernoulli(π) indicates whether day i is an incident day.
If z_i = 0: y_i ~ Poisson(λ_i).
If z_i = 1: y_i ~ Poisson(κ · λ_i) with κ > 1 (a multiplicative spike factor), or use a separate incident rate.

Now posterior predictive simulations can generate occasional spikes without forcing all days to be noisy. Diagnostics to add here include: does the model reproduce the distribution of run lengths between spikes, and does it place incident probability on the right days (if you have incident labels)? Even without labels, posterior predictive simulations will show whether the mixture generates realistic spike frequency and magnitude simultaneously.

# Pseudocode for posterior predictive simulation with a mixture
for s in 1..S:
  draw beta_s, pi_s, kappa_s
  for i in 1..N:
    lambda_is = exp(X[i] dot beta_s)
    z_rep = bernoulli_rng(pi_s)
    rate = lambda_is if z_rep==0 else kappa_s * lambda_is
    y_rep[s,i] = poisson_rng(rate)

This kind of repair is especially useful when the business question is “How many surge days should we staff for?” because the model can explicitly represent surge regimes.

Model repair option 3: Add time dependence (correlated errors and clustering)

Ticket spikes often cluster: a bad release causes elevated volume for several days. If your model assumes conditional independence day-to-day, it may understate the probability of multi-day surges. A time-dependent component can help, such as a latent daily effect that evolves smoothly (random walk on log-rate) or an autoregressive term on residuals.

Posterior predictive simulations are again the litmus test: simulate full time series and compare clustering metrics, such as the number of consecutive days above 40, or the autocorrelation of y. If the observed data has stronger autocorrelation than the simulations, independence is a miscalibration source.

Clustering metric: max run length of days with y > 40.
Autocorrelation metric: corr(y_t, y_{t-1}).

How to run the diagnostic loop without getting lost

Posterior predictive simulation is most effective as a tight loop: pick a small set of checks tied to decisions, run them, change one modeling assumption, rerun, and compare. A practical way to keep this disciplined is to maintain a “checklist” of diagnostics that you compute for every candidate model so you can see what improves and what degrades.

Tail: max(y), count(y > threshold), 95th percentile.
Spread: sd(y), interquantile range.
Coverage: 50% and 90% interval coverage overall and on high-risk days.
Structure: day-of-week pattern, autocorrelation, clustering of surges.

When you compare models, avoid the temptation to optimize a single number. For staffing, you might accept slightly worse average error if it dramatically improves tail calibration, because the cost of being understaffed on surge days dominates. Posterior predictive simulations let you see those tradeoffs directly: you can simulate staffing outcomes under each model by translating predicted ticket counts into required agents and computing the frequency of understaffing events.

A concrete “before vs after” diagnostic narrative

To make the workflow tangible, imagine you run the checks on the original Poisson model and then on a Negative Binomial model:

Poisson: simulated maxima rarely exceed 45; observed max 71 is essentially impossible. 90% intervals miss all four surge days. Coverage is 75% instead of 90%.
Negative Binomial: simulated maxima sometimes reach 65–80; observed max 71 is no longer extreme. Simulated surge-day counts often fall between 1 and 5; observed 4 is typical. 90% interval coverage rises to ~88–92% overall, and at least some surge days are inside the interval.

This is what “diagnosing miscalibration” looks like in practice: you identify the mismatch (tails and variability), propose a mechanism (overdispersion), refit, and verify via posterior predictive simulations that the mismatch is reduced in the aspects that matter for decisions.

Implementation notes: making posterior predictive checks routine

Two habits make this process scalable in real projects. First, always simulate replicated datasets as a standard output of model fitting, not as an afterthought. Second, predefine a small set of domain-specific test statistics that you compute automatically and track over time (especially if the data-generating process changes).

In production forecasting, you can run posterior predictive simulations on a rolling window (e.g., last 8 weeks) and monitor calibration drift: if coverage drops or tail metrics become extreme again, it signals that the world changed (new product behavior, new marketing channel, new incident pattern) and the model needs updating. The key is that posterior predictive simulation is not just a one-time “model check”; it is a practical diagnostic instrument for keeping predictive uncertainty aligned with reality.

Now answer the exercise about the content:

A Bayesian Poisson ticket-volume model matches average daily counts but repeatedly fails to warn about 60+ ticket spikes. Which posterior predictive check most directly tests whether the model can generate spikes of the observed magnitude?

You are right! Congratulations, now go to the next page

You missed! Try again.

Comparing max(y) to max values from posterior predictive replications targets tail behavior. If the observed maximum is far outside simulated maxima, the model is miscalibrated for extremes and underestimates spike risk.