All courses > School Subjects > Statistics ::

From Beliefs to Evidence: Priors, Likelihood, and Posteriors

Capítulo 2

Estimated reading time: 13 minutes

Why Bayesian Updating Matters in Practice

Bayesian statistics is a disciplined way to move from what you currently believe to what the data suggests, without pretending you started from zero. In real-world decisions, you almost always have some prior information: past performance, expert judgment, earlier experiments, domain constraints, or even a rough baseline expectation. Bayesian updating turns that information into a prior distribution, combines it with a likelihood model for how data would be generated, and produces a posterior distribution that represents what you should believe after seeing the evidence.

In this chapter, you will learn how priors, likelihoods, and posteriors fit together mathematically and operationally. You will also learn how to build them step by step for common business and scientific situations, how to interpret the resulting posterior, and how to check whether your modeling choices are reasonable.

The Three Building Blocks: Prior, Likelihood, Posterior

Prior: encoding beliefs before seeing current data

A prior distribution describes uncertainty about an unknown quantity before observing the current dataset. The unknown quantity is often called a parameter (for example, a conversion rate, a defect probability, an average response time, or a treatment effect). A prior is not a single guess; it is a probability distribution that expresses a range of plausible values and how plausible each value is.

Two practical ways to think about priors are: (1) as a summary of previous evidence (historical data, earlier trials), and (2) as a regularizer that prevents extreme conclusions from small or noisy samples. Priors can be weakly informative (broad but realistic), informative (tight due to strong prior knowledge), or intentionally skeptical (shrinking effects toward no change unless data is compelling).

Likelihood: a model of how data arises

The likelihood connects parameters to observed data. It answers: if the parameter had a particular value, how probable is it that we would observe data like this? The likelihood is determined by your data-generating assumptions: distributional form (Bernoulli, Binomial, Normal, Poisson, etc.), independence assumptions, and any covariates or structure you include.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

In practice, choosing a likelihood is often the most consequential modeling decision. A prior can be adjusted, but a mismatched likelihood can systematically misrepresent evidence. For example, modeling heavy-tailed data with a Normal likelihood can understate uncertainty and overreact to outliers.

Posterior: updated beliefs after seeing evidence

The posterior distribution is the result of combining prior and likelihood via Bayes’ rule. It represents your updated uncertainty about the parameter after observing the data. Importantly, the posterior is still a distribution: it quantifies both what values are plausible and how uncertain you remain.

From the posterior, you can compute summaries that support decisions: posterior mean or median (point estimates), credible intervals (uncertainty ranges), probabilities of exceeding thresholds (for example, probability conversion rate is above 3%), and predictive distributions for future outcomes.

Bayes’ Rule in One Line (and What It Means)

Bayes’ rule for a parameter \(\theta\) and data \(y\) is: posterior \(p(\theta\mid y)\) is proportional to likelihood \(p(y\mid \theta)\) times prior \(p(\theta)\). In symbols: \(p(\theta\mid y) \propto p(y\mid \theta)\,p(\theta)\). The missing proportionality constant is the evidence (also called the marginal likelihood), which ensures the posterior integrates to 1.

Operationally, this means you can think in two steps: start with a prior curve over \(\theta\), then reweight each \(\theta\) by how well it explains the observed data. Values of \(\theta\) that make the data more likely get more posterior weight; values that make the data unlikely get downweighted.

A Concrete Example: Conversion Rate with a Beta Prior

Problem setup

Suppose you run an online signup flow and want to estimate the true conversion rate \(p\). You observe \(n\) visitors and \(k\) conversions. A common likelihood is Binomial: \(k \sim \text{Binomial}(n, p)\). The conjugate prior for \(p\) under a Binomial likelihood is a Beta distribution: \(p \sim \text{Beta}(\alpha, \beta)\).

The Beta distribution is convenient because it lives on \([0,1]\) and can represent many shapes: uniform, skewed toward 0, skewed toward 1, or peaked around a middle value. The parameters \(\alpha\) and \(\beta\) can be interpreted as prior pseudo-counts: \(\alpha-1\) prior successes and \(\beta-1\) prior failures (this is an intuition, not a literal dataset).

Posterior update (closed form)

With a Beta prior and Binomial likelihood, the posterior is also Beta: \(p\mid k,n \sim \text{Beta}(\alpha + k, \beta + n - k)\). This is Bayesian updating in its simplest form: you add observed successes to \(\alpha\) and observed failures to \(\beta\).

Step-by-step: choosing a prior that matches your knowledge

Step 1: Decide what “typical” conversion rates look like in your context. For example, you might believe most similar pages convert around 3% to 6%.

Step 2: Translate that belief into a prior mean and strength. The mean of \(\text{Beta}(\alpha,\beta)\) is \(\alpha/(\alpha+\beta)\). The total \(\alpha+\beta\) controls how strongly the prior resists being moved by data.

Step 3: Pick \(\alpha\) and \(\beta\). Suppose you want a prior mean of 4% and you want the prior to be equivalent to about 100 visitors of information. Then set \(\alpha+\beta=100\) and \(\alpha=4\), \(\beta=96\). That gives \(\text{Beta}(4,96)\), a weakly-to-moderately informative prior centered at 4%.

Step 4: Update with data. If you observe \(n=200\) visitors and \(k=12\) conversions (6%), the posterior is \(\text{Beta}(4+12, 96+188)=\text{Beta}(16,284)\). The posterior mean becomes \(16/(16+284)\approx 5.33%\), which is between the prior mean (4%) and the sample rate (6%), weighted by their effective sample sizes.

Step-by-step: extracting decision-relevant quantities

Step 1: Compute a credible interval for \(p\). A 95% credible interval is the range \([p_{2.5%}, p_{97.5%}]\) under the posterior. This answers: given the model and prior, there is a 95% probability that \(p\) lies in this interval.

Step 2: Compute a probability of beating a target. If your business target is 5%, compute \(P(p > 0.05 \mid data)\) from the posterior. This directly supports go/no-go decisions.

Step 3: Predict future outcomes. Use the posterior predictive distribution to estimate how many conversions you might see in the next \(m\) visitors. This is often more actionable than a parameter estimate because it maps uncertainty into future counts and risk.

Interpreting Priors as “Equivalent Data” (Carefully)

For conjugate models like Beta-Binomial, the prior often has an “equivalent sample size” interpretation: \(\alpha+\beta\) behaves like the amount of prior information. This is useful for calibration: if you set \(\alpha+\beta\) too large, your posterior will barely move even with substantial new data; if you set it too small, the prior barely matters and you may get unstable estimates with small samples.

However, treat equivalent sample size as a tuning metaphor, not a guarantee. It depends on the likelihood being appropriate and on the prior actually reflecting your knowledge. A strongly informative prior that is wrong can dominate the posterior and lead to confident but incorrect conclusions.

Likelihood Choices: Matching the Data You Actually Have

Binary outcomes: Bernoulli/Binomial

Use Bernoulli for individual 0/1 outcomes and Binomial for aggregated counts. This fits conversions, pass/fail tests, defect occurrence, and click/no-click events. If data are overdispersed (more variability than Binomial predicts), consider a hierarchical model or a Beta-Binomial likelihood to capture heterogeneity across users, days, or segments.

Counts: Poisson (and when it breaks)

For event counts per unit time or space (support tickets per day, failures per week), Poisson likelihood is common. It assumes the mean equals the variance. If variance is larger than the mean (overdispersion), a Negative Binomial likelihood is often a better fit.

Continuous measurements: Normal and robust alternatives

For measurements like response times, revenue per user, or sensor readings, a Normal likelihood is a starting point when data are roughly symmetric and not too heavy-tailed. When you see outliers or skew, consider a Student-t likelihood (robust to outliers) or model the log of the measurement (log-normal behavior) if multiplicative effects dominate.

Another Worked Example: Mean of a Measurement with Normal Likelihood

Problem setup

Suppose you measure the average time (in seconds) to complete a task. Let \(y_1,\dots,y_n\) be observed times. A common likelihood is \(y_i \sim \mathcal{N}(\mu, \sigma^2)\). If \(\sigma\) is known (or treated as known from stable instrumentation), you can place a Normal prior on \(\mu\): \(\mu \sim \mathcal{N}(\mu_0, \tau_0^2)\).

Posterior update (intuition)

The posterior for \(\mu\) is also Normal. The posterior mean becomes a precision-weighted average of the prior mean \(\mu_0\) and the sample mean \(\bar{y}\). “Precision” is the inverse variance: higher precision means more confidence. If the data are noisy (large \(\sigma^2\) or small \(n\)), the posterior stays closer to the prior; if the data are precise (small \(\sigma^2\) and large \(n\)), the posterior moves toward \(\bar{y}\).

Step-by-step: building a weakly informative prior

Step 1: Identify plausible bounds. If task completion times are almost always between 10 and 120 seconds, your prior should put most mass in that range.

Step 2: Choose a center. If typical time is around 45 seconds, set \(\mu_0=45\).

Step 3: Choose a spread. Pick \(\tau_0\) so that, say, 95% of prior mass lies roughly within plausible bounds. A rough rule is that 95% of a Normal lies within about \(\pm 2\tau_0\). If you want 45 \(\pm\) 40 seconds to cover most plausible values, set \(\tau_0\approx 20\).

Step 4: Update with data and interpret the posterior mean and credible interval. The credible interval tells you not only an estimate but also how much uncertainty remains, which is essential for deciding whether a change is practically meaningful.

Posterior Predictive: Turning Parameter Uncertainty into Forecasts

The posterior describes uncertainty about parameters; the posterior predictive describes uncertainty about future observations. For a future data point \(\tilde{y}\), the posterior predictive distribution is \(p(\tilde{y}\mid y)=\int p(\tilde{y}\mid \theta)\,p(\theta\mid y)\,d\theta\). This integral averages predictions over plausible parameter values, automatically accounting for parameter uncertainty.

Practically, posterior predictive answers questions like: how many signups should we expect tomorrow, and how variable might that be? What is the probability next week’s defect count exceeds a safety threshold? What range of response times should we plan capacity for? These are decision-shaped questions, and the posterior predictive is the Bayesian tool that maps beliefs into operational forecasts.

How to Check Whether Your Prior and Likelihood Make Sense

Prior predictive checks (before seeing data)

A prior predictive check simulates fake datasets from the model using only the prior and likelihood: draw \(\theta\sim p(\theta)\), then draw \(y\sim p(y\mid \theta)\). If the simulated data look wildly unrealistic (conversion rates near 50% when you know they are around 5%, or negative times, or implausibly huge counts), your prior and/or likelihood is miscalibrated.

Step-by-step: (1) sample many \(\theta\) values from the prior, (2) simulate datasets from the likelihood, (3) compute summary statistics you care about (rates, means, maxima), (4) compare those summaries to what you consider plausible in your domain. Adjust prior scale or likelihood accordingly.

Posterior predictive checks (after seeing data)

A posterior predictive check simulates data from the posterior: draw \(\theta\sim p(\theta\mid y)\), then draw replicated data \(y^{rep}\sim p(y\mid \theta)\). Compare replicated data to observed data using plots or summary statistics. If the model cannot reproduce key features of the observed data (skew, variance, outlier frequency, seasonality), the likelihood is likely misspecified or missing structure.

When Conjugacy Ends: Computing Posteriors with Simulation

Some models do not yield a closed-form posterior. This happens when you add realistic complexity: multiple parameters, hierarchical structure, non-linear relationships, censoring, missing data mechanisms, or robust likelihoods. In these cases, you often compute the posterior using simulation methods such as Markov chain Monte Carlo (MCMC) or approximate methods like variational inference.

Even when you use simulation, the conceptual workflow stays the same: specify a prior, specify a likelihood, compute a posterior, then summarize and validate using predictive checks. The difference is computational: instead of a neat formula like Beta-Binomial, you obtain samples from the posterior and compute quantities of interest from those samples.

Practical Workflow: From Beliefs to Evidence in a Repeatable Recipe

Step 1: Define the parameter(s) you care about

Be explicit about what is unknown. Examples: a conversion rate \(p\), a mean \(\mu\), a rate \(\lambda\), a difference in conversion rates \(p_B-p_A\), or a ratio of rates. Clear parameter definitions prevent confusion later when interpreting the posterior.

Step 2: Choose a likelihood that matches measurement and noise

Write down how data would look if you knew the parameter. Decide whether outcomes are binary, counts, or continuous, and whether variance patterns suggest overdispersion or heavy tails. If you have grouping (by region, day, device), consider whether a hierarchical likelihood is needed to avoid pooling everything together or splitting too aggressively.

Step 3: Build a prior that is defensible and testable

Use domain knowledge to set plausible ranges and typical values. Prefer weakly informative priors that rule out nonsense but do not overwhelm the data. When you do have strong prior evidence (previous experiments), encode it transparently and document its source. Then run a prior predictive check to ensure the implied data are realistic.

Step 4: Compute the posterior and summarize it for decisions

Use analytic updates when available (like Beta-Binomial) or simulation when needed. Summarize with credible intervals, exceedance probabilities, and posterior predictive forecasts. Avoid relying on a single point estimate when uncertainty is decision-relevant.

Step 5: Validate with posterior predictive checks

Confirm the model can reproduce key patterns in the observed data. If it cannot, revise the likelihood (for example, add overdispersion or robustness) or reconsider assumptions (independence, constant variance, missingness).

Mini Cheat Sheet: Common Prior-Likelihood Pairings

Binary rate \(p\): Binomial likelihood with Beta prior (fast updates, intuitive pseudo-counts).
Count rate \(\lambda\): Poisson likelihood with Gamma prior (useful for event rates; consider Negative Binomial for overdispersion).
Normal mean \(\mu\) with known variance: Normal likelihood with Normal prior (posterior mean is a precision-weighted average).
Regression coefficients: Normal priors (often centered at 0) with likelihood depending on outcome type (Normal, Bernoulli with logistic link, Poisson, etc.).

Worked Micro-Example: Probability a Rate Exceeds a Threshold

Suppose you have a Beta posterior for conversion rate \(p\): \(p\mid data \sim \text{Beta}(16,284)\). A common decision question is: what is the probability that \(p\) exceeds 5%? This is \(P(p > 0.05\mid data)=1-F_{\text{Beta}(16,284)}(0.05)\), where \(F\) is the Beta cumulative distribution function.

Step-by-step: (1) compute the Beta CDF at the threshold, using software, (2) subtract from 1, (3) compare to a decision rule (for example, ship if probability exceeds 0.9). This approach is often more aligned with operational decisions than asking whether a p-value crosses 0.05, because it directly quantifies the probability of meeting a business-relevant target under uncertainty.

# Pseudocode (language-agnostic) for exceedance probability with Beta posterior alpha, beta and threshold t: prob = 1 - BetaCDF(t, alpha, beta)

Now answer the exercise about the content:

In Bayesian updating, what does a posterior predictive distribution primarily help you answer?

You are right! Congratulations, now go to the next page

You missed! Try again.

The posterior predictive describes uncertainty about future observations by integrating the likelihood over the posterior, so forecasts reflect both outcome noise and parameter uncertainty.