All courses > School Subjects > Statistics ::

Model Checking and Calibration You Can Actually Use

Capítulo 19

Estimated reading time: 13 minutes

What “model checking and calibration” mean in practice

Bayesian modeling gives you a posterior distribution, but that does not guarantee your model is a good description of how the world generates data. Model checking asks: “If this model were true, would it produce data that look like what we observe?” Calibration asks: “When the model says an event has probability p, does it happen about p of the time in the long run?” These are different: a model can fit historical data (look plausible) yet be poorly calibrated for decisions, or be roughly calibrated on average but miss important patterns (like seasonality or heavy tails) that matter for risk.

In real-world work, model checking is not about proving a model is true; it is about finding mismatches that are decision-relevant. Calibration is not about perfect probability forecasts; it is about making probabilities trustworthy enough that expected-loss decisions behave as intended. The goal is to create a repeatable workflow: generate predictions, compare to reality, diagnose failures, and update the model or its use.

Posterior predictive checks (PPC): the workhorse you’ll actually use

Posterior predictive checks compare observed data to data simulated from the model. The key idea is simple: draw parameters from the posterior, simulate replicated datasets, compute summary statistics (or visual patterns) on both, and see whether the observed statistics look typical under the model. If the observed data look extreme relative to the replicated data, the model is missing something.

Step-by-step: a practical PPC workflow

Step 1: Choose what “looks like the data” means. Pick test quantities that reflect the decisions you care about. For a conversion model, you might check overall conversion rate, per-day variability, and the number of zero-conversion days. For revenue, you might check tail behavior (e.g., 95th percentile) and the frequency of very large orders.

Step 2: Simulate replicated data. For each posterior draw, simulate a dataset of the same shape as the observed one (same number of days, users, groups, etc.). This creates a distribution of what you would expect to see if the model were correct.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Step 3: Compare observed vs replicated. Use plots (histograms, time-series overlays, residual plots) and simple discrepancy measures. You are looking for systematic gaps, not tiny differences.

Step 4: Diagnose the mismatch. Map the failure to a modeling assumption: wrong noise model, missing covariate, unmodeled heterogeneity, dependence over time, censoring, or measurement error.

Step 5: Decide what to do. Options include: refine the likelihood (e.g., heavier tails), add structure (trend/seasonality), add hierarchical components, include covariates, or keep the model but adjust decision rules (e.g., add safety margins) if the mismatch is small but risk-relevant.

Example: PPC for daily conversions with overdispersion

Suppose you model daily conversions as Binomial with a constant conversion probability. A PPC might reveal that the observed day-to-day variability is much larger than what the Binomial model generates. You might see too many “weird days” (very high or very low conversion) compared to simulations. This is a classic sign of overdispersion: the conversion probability is not constant; it varies by day due to traffic mix, campaigns, outages, or unobserved factors.

A practical fix is to introduce a day-level random effect (a latent daily conversion probability) or use a Beta-Binomial style observation model for counts aggregated by day. The PPC target quantity could be the variance of daily conversion rates; after the fix, the observed variance should fall within the replicated distribution more often.

Residual checks that work for Bayesian models

Residuals are still useful, but you should define them in a way that respects the predictive distribution. Two practical residual types are: (1) raw residuals using posterior predictive means, and (2) probability integral transform (PIT) residuals, which test whether observations look like draws from the predictive distribution.

Raw residuals with uncertainty bands

Compute the posterior predictive mean for each observation (or each time point), then subtract it from the observed value. Plot residuals versus fitted values, time, or key covariates. Add uncertainty by plotting predictive intervals for each point and marking whether the observation falls inside. If you see patterns (e.g., residuals drift upward over time), you likely need a trend component or a missing covariate.

PIT and rank histograms (calibration for distributions)

For each observation y, compute the predictive CDF value u = P(Y ≤ y | data). If the predictive distribution is well calibrated, these u values should look Uniform(0,1). In practice, you can plot a histogram of u values. A U-shaped histogram suggests predictive intervals are too narrow (under-dispersed). A hump in the middle suggests intervals are too wide (over-dispersed). Skew indicates bias (systematic over- or under-prediction).

For discrete data (counts, binary outcomes), use randomized PIT or rank histograms based on replicated draws. The goal is the same: check whether the predictive distribution has the right spread and center.

Calibration you can use for decision-making probabilities

Many Bayesian outputs are probabilities used directly in decisions: probability demand exceeds capacity, probability a variant beats baseline by at least a threshold, probability churn exceeds a risk limit, and so on. Calibration here means: when you say “80% chance,” the event should occur about 80% of the time across comparable situations. This is especially important when you automate decisions or set policies based on probability thresholds.

Step-by-step: empirical calibration for a binary event

Step 1: Define the event and the forecast probability. Example event: “Next week’s stockout occurs.” Forecast probability: p_i from your posterior predictive distribution for each week i.

Step 2: Collect out-of-sample outcomes. Use a rolling-origin evaluation: fit on past data, forecast the next period, record p_i and the realized outcome y_i (0/1), then move forward.

Step 3: Bin predictions and compare. Group predictions into bins (e.g., 0–0.1, 0.1–0.2, …). For each bin, compute the average predicted probability and the observed frequency. Plot observed frequency vs predicted probability (a reliability diagram).

Step 4: Quantify miscalibration. Use metrics like Brier score (mean squared error for probabilities) and calibration error (e.g., weighted absolute difference between predicted and observed frequencies across bins). Also track sharpness: a model that always predicts 0.5 can be calibrated but not useful.

Step 5: Fix or post-process. If miscalibrated, you can (a) improve the model (missing covariates, wrong noise, nonstationarity), or (b) apply calibration mapping such as isotonic regression or logistic calibration on the predicted probabilities. Post-processing is often faster, but it should be validated and monitored because calibration can drift.

Reliability diagrams for continuous forecasts: coverage checks

For continuous outcomes (revenue, demand, time-to-resolution), a practical calibration check is interval coverage. If you produce 80% predictive intervals, then about 80% of future observations should fall inside them. Compute coverage for multiple levels (50%, 80%, 95%). Under-coverage means intervals are too narrow (risk underestimated). Over-coverage means intervals are too wide (forecasts too cautious), which can also be costly if it leads to over-buffering inventory or staffing.

Out-of-sample predictive validation: avoid “checking on the training data”

Posterior predictive checks on the observed dataset are valuable for diagnosing model shape, but they can miss overfitting or nonstationarity. For decisions, you usually care about future performance. A practical Bayesian workflow includes out-of-sample predictive validation: evaluate predictive distributions on data not used for fitting.

Step-by-step: time-series friendly validation

Step 1: Choose a backtesting scheme. For time-ordered data, use rolling or expanding windows. Example: train on weeks 1–20, predict week 21; train on 1–21, predict 22; and so on.

Step 2: Score the full predictive distribution. Use proper scoring rules that reward both calibration and sharpness. Common choices: log score (log predictive density at the observed value) and CRPS (continuous ranked probability score) for continuous outcomes. For binary events, use log loss or Brier score.

Step 3: Compare models and variants. Evaluate whether improvements are consistent across time and segments. A small average gain can hide large failures in specific regimes (holiday season, outages, new product launches).

Step 4: Tie scores to decisions. If the decision is threshold-based (e.g., act when P(event) > 0.7), also evaluate decision outcomes: false positives, false negatives, and realized cost under your cost model.

Common failure modes and what to check first

When a Bayesian model misbehaves, the issue is often not “Bayesian vs frequentist,” but a mismatch between assumptions and data-generating reality. The fastest way to improve is to recognize common patterns and run targeted checks.

Underestimated tails: Extreme outcomes happen more often than predicted. Check tail quantiles in PPC, PIT U-shape, and interval under-coverage at 95%. Fix with heavier-tailed likelihoods, mixture models, or explicit outlier components.
Unmodeled heterogeneity: Subgroups behave differently. Check PPC by segment (device, region, store, cohort). Fix with hierarchical structure or interaction terms.
Nonstationarity: Relationships change over time. Check residuals vs time and rolling calibration. Fix with time-varying parameters, trend/seasonality, or regime indicators.
Dependence: Observations are correlated (time, users, sessions). Check autocorrelation of residuals and PPC on run-lengths (streaks). Fix with time-series components or clustered likelihoods.
Measurement issues: Censoring, truncation, missingness, delayed reporting. Check data pipeline artifacts and PPC on missingness patterns. Fix by modeling the measurement process or adjusting the dataset.

Calibration for hierarchical and multi-group settings

In multi-group problems (stores, regions, classes, cohorts), you often care about group-level predictions and rankings. A model can be calibrated overall but miscalibrated for small groups. Practical calibration here means checking coverage and reliability by group size and by group type.

Step-by-step: group-wise coverage and “small group” stress tests

Step 1: Stratify groups by sample size. Create bins like small, medium, large groups based on exposure (users, days, transactions).

Step 2: Compute predictive interval coverage within each bin. For each group and period, check whether the realized outcome falls inside the group’s predictive interval. Aggregate coverage by size bin.

Step 3: Check shrinkage side effects. If small groups are systematically under-covered, your model may be over-shrinking (too little group variability). If they are over-covered, you may be too uncertain (too much variability or too weak information sharing).

Step 4: Adjust the group-level variance model. Consider richer group variance structures, covariates that explain group differences, or alternative distributions for group effects (heavier tails if some groups are true outliers).

Simulation-based calibration (SBC): verifying the inference engine, not just the model

Sometimes the issue is not the model but the inference procedure (bugs, poor MCMC convergence, variational approximations that underestimate uncertainty). Simulation-based calibration checks whether your Bayesian computation can recover known parameters when data are generated from the model itself.

Step-by-step: SBC in a lightweight, usable form

Step 1: Sample “true” parameters from the prior. Draw θ^(s) from your prior distribution.

Step 2: Simulate a dataset from the likelihood. Generate y^(s) ~ p(y | θ^(s)).

Step 3: Fit the model to the simulated dataset. Obtain posterior draws for θ given y^(s).

Step 4: Compute rank statistics. For each parameter component, compute the rank of the true value among posterior draws. Over many simulations, ranks should be uniform if inference is correct.

Step 5: Interpret failures. Non-uniform ranks can indicate biased inference, too-narrow posteriors, or convergence problems. This is especially useful when you deploy a model and want confidence that the computation is not silently wrong.

From checks to fixes: a practical debugging playbook

Model checking becomes useful when it leads to concrete changes. A good playbook starts with the simplest checks and escalates only when needed.

Playbook: diagnose and iterate

Start with predictive plots: overlay observed vs predictive intervals over time; compare histograms of observed vs replicated; check tail quantiles.
Check calibration at decision thresholds: if you act at P>0.8, evaluate reliability specifically around 0.7–0.9, not just overall.
Segment the checks: run PPC and calibration by key segments where decisions differ (new vs returning users, regions, channels).
Stress the model with “what breaks us” scenarios: holidays, outages, launches, sparse periods. Evaluate predictive performance separately.
Fix the biggest mismatch first: if tails are wrong, fix tails before adding minor covariates; if nonstationary, add time structure before fine-tuning priors.
Re-check after every change: keep a small dashboard of PPC plots, coverage, and scoring metrics so improvements are measurable.

Concrete mini-walkthrough: forecasting support tickets with usable calibration

Imagine you forecast daily support tickets to staff a team. You produce a predictive distribution for tomorrow’s ticket count and choose staffing so that capacity exceeds demand with 90% probability (to keep SLA risk low). If your 90% predictive interval under-covers (only 75% of days fall below the 90% quantile), you will be understaffed more often than intended.

Step-by-step: turning a calibration failure into a fix

Step 1: Backtest with rolling windows. For each day in the last 60 days, fit the model on prior data and forecast the next day’s distribution. Record the predicted 90% quantile q90_i and the realized count y_i.

Step 2: Check quantile coverage. Compute the fraction of days where y_i ≤ q90_i. If it is far below 0.9, your predictive distribution is too narrow or biased low.

Step 3: Run PPC on dispersion and streaks. Compare observed variance of daily counts to replicated variance. Also compare the distribution of run lengths (e.g., how often you get 3 high days in a row). Too many streaks suggests temporal dependence.

Step 4: Apply targeted model changes. If variance is too low, use a count model with extra dispersion (e.g., Negative Binomial). If streaks are too common, add a day-of-week effect and a latent time component (random walk) so the mean can drift.

Step 5: Re-evaluate decision calibration. After refitting, re-run the rolling backtest and recompute 90% coverage. Also evaluate the staffing policy: how often were you understaffed, and what was the realized cost (overtime, SLA penalties) compared to before?

Implementation sketch: PPC and calibration in code-like steps

The exact code depends on your tooling, but the logic is stable. The following pseudocode outlines the core loops you can implement in Python/R with your Bayesian library of choice.

# Posterior predictive check (generic) fit model to data -> posterior_draws for s in 1..S:   theta_s = posterior_draws[s]   y_rep_s = simulate_data(theta_s, shape=observed_shape)   T_rep[s] = test_quantity(y_rep_s) T_obs = test_quantity(y_obs) compare_distribution(T_rep, T_obs)  # e.g., plot histogram and mark T_obs

# Rolling calibration for a binary event for t in evaluation_times:   fit model on data up to t-1   p_t = predictive_probability(event at t)   y_t = observed_event(t)   store(p_t, y_t) reliability_diagram(p_list, y_list, bins=10) brier = mean((p_list - y_list)^2)

# Interval coverage for continuous outcomes for t in evaluation_times:   fit model on data up to t-1   pred_dist = predictive_distribution(t)   lo, hi = quantile(pred_dist, [0.1, 0.9])  # 80% interval   covered[t] = (lo <= y_t <= hi) coverage = mean(covered)

Operationalizing checks: what to monitor in production

If a model informs recurring decisions, you want lightweight monitoring that catches drift early. Practical monitoring focuses on a few indicators that map to business risk: calibration near decision thresholds, interval coverage at key levels, and tail-event frequency.

Threshold calibration: track how often events occur when predicted probability is in the “action zone” (e.g., 0.7–0.9).
Coverage dashboard: weekly coverage for 50/80/95% predictive intervals, overall and by segment.
Tail monitoring: frequency of outcomes beyond predicted 95th percentile; spikes indicate regime change or underestimated variance.
Data health checks: missingness, delayed reporting, sudden shifts in covariate distributions (data drift) that can break calibration.

When monitoring flags a problem, the response should be pre-planned: retrain schedule, recalibration mapping update, or a temporary policy change (e.g., increase safety buffer) until the model is updated.

Now answer the exercise about the content:

A posterior predictive check shows that observed day-to-day conversion variability is much larger than what a Binomial model with constant conversion probability simulates. What is the most appropriate next step?

You are right! Congratulations, now go to the next page

You missed! Try again.

Excess variability versus Binomial simulations indicates overdispersion, meaning the conversion probability likely varies by day. Adding a day-level random effect or a Beta-Binomial observation model increases dispersion so replicated data better match observed patterns.