All courses > School Subjects > Statistics ::

Bayesian A/B Testing Beyond p-Values

Capítulo 7

Estimated reading time: 11 minutes

Why Bayesian A/B Testing Is Not “Bayesian p-Values”

In many teams, “Bayesian A/B testing” gets reduced to a single number that looks like a frequentist p-value: a probability that variant B is better than A. That number can be useful, but it is not the goal. The goal is to make a decision under uncertainty using quantities that map directly to business outcomes: expected lift, probability of exceeding a minimum effect, probability of loss, and expected regret. Bayesian A/B testing is most valuable when you stop asking “Is it significant?” and start asking “What should we ship, given the trade-offs and our risk tolerance?”

In this chapter, you will learn how to structure Bayesian A/B tests around decision metrics, how to handle sequential monitoring without “peeking penalties,” how to incorporate practical constraints like minimum detectable effect (MDE) and guardrails, and how to extend beyond simple conversion rates to revenue per user and multi-metric decisions.

Decision Quantities That Replace p-Values

Probability of superiority (but used correctly)

A common Bayesian output is P(B > A | data). This is the probability that the true performance of B exceeds A. Unlike a p-value, it is directly interpretable. However, it can still mislead if the effect size is tiny. For example, you might have P(B > A) = 0.95 but the expected lift is only 0.1%, which may not justify engineering risk or rollout complexity.

Probability of exceeding a practical threshold

Instead of asking whether B is better at all, ask whether it is better by enough: P(B − A > δ), where δ is a minimum practical improvement (e.g., +0.5% relative conversion lift, +$0.02 revenue per user, −1% churn). This aligns the analysis with product and business constraints.

Expected lift and credible range of lift

Compute the posterior distribution of the lift L = B − A (absolute) or L = (B − A)/A (relative). Summarize it with the posterior mean (or median) and a credible interval for L. This gives both magnitude and uncertainty, which are essential for prioritization.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Expected loss (risk) and expected regret

Two variants can be compared by the expected loss of choosing one when the other is actually better. A practical approach is expected regret: if you choose B, regret is max(0, A − B) in the metric of interest. The posterior expected regret quantifies how costly a wrong decision might be. Teams often set rules like “Ship B if expected regret is below a threshold” or “Ship if probability of loss is below 5% and expected lift is above δ.”

A Practical Step-by-Step Bayesian A/B Workflow (Conversion Example)

Step 1: Define the decision and the metric

Write the decision in operational terms: “Roll out variant B to 100% of traffic” or “Keep A and iterate.” Choose a primary metric (e.g., conversion rate) and define the unit (user, session) and attribution window. Decide whether the metric is a proportion (converted or not), a count rate (events per user), or a continuous value (revenue per user). This chapter focuses on the decision workflow; the modeling details depend on metric type.

Step 2: Define practical thresholds and risk constraints

Pick δ, the minimum practical improvement, and define acceptable risk. Examples: (1) Ship if P(lift > 0) > 0.95 and P(lift > δ) > 0.80. (2) Ship if expected regret < 0.0002 conversions per user. (3) Do not ship if P(guardrail drop < −g) > 0.10 for any guardrail metric (e.g., latency, unsubscribe rate).

Step 3: Plan sequential monitoring rules

Bayesian monitoring allows you to look at the data as it arrives and update your decision metrics without needing a p-value correction for peeking. The key is to predefine a stopping rule based on decision quantities, not on “significance.” Common stopping rules include: stop and ship when P(lift > δ) exceeds a threshold and guardrails are safe; stop and reject when P(lift < 0) exceeds a threshold; or stop when the expected value of collecting more data is low (a value-of-information rule).

Step 4: Compute posterior draws for each variant

In practice, you often compute posterior samples (Monte Carlo draws) for the metric of each variant, then transform those draws into lift, probability statements, and regret. This approach generalizes across many models and metrics. For a conversion metric, you would generate many draws of conversion rate for A and B from their posteriors, then compute lift for each draw.

Step 5: Derive decision metrics from the draws

From paired draws (A_i, B_i), compute: (1) superiority indicator I(B_i > A_i) and average it to estimate P(B > A); (2) threshold indicator I(B_i − A_i > δ) for P(lift > δ); (3) lift distribution summary (mean, median, credible interval); (4) regret if choosing B: r_i = max(0, A_i − B_i), then average r_i for expected regret; (5) probability of loss P(B < A) and expected loss E[A − B | A > B] if you want a conditional loss measure.

Step 6: Make the decision and document it

Decide using the pre-agreed thresholds. Document the decision metrics, the stopping time, and any deviations (e.g., traffic imbalance, instrumentation issues). The documentation is not bureaucracy: it prevents “metric shopping” and makes future tests comparable.

Worked Example Structure (Using Posterior Sampling)

Suppose you ran an experiment where A had 10,000 users with 1,000 conversions and B had 10,000 users with 1,050 conversions. You want to decide whether to ship B. Instead of asking for a p-value, you compute posterior draws for conversion rates pA and pB, then compute the lift distribution. You might find results like: P(pB > pA) = 0.93, expected absolute lift = 0.005, and P(pB − pA > 0.003) = 0.72. If your δ is 0.003 (0.3 percentage points) and your ship rule requires 0.80 probability of exceeding δ, you would not ship yet, even though superiority is high. You might continue collecting data or decide the likely lift is too small to matter.

The key is that the decision is driven by a practical threshold and risk tolerance, not by a binary “significant/not significant” label.

Sequential Testing Without the p-Value Mindset

What changes when you monitor daily?

With Bayesian updating, the posterior after day 7 is a coherent update of the posterior after day 6. You can compute P(lift > δ) each day and stop when it crosses your decision boundary. The caution is not mathematical incoherence; the caution is behavioral: teams may change thresholds midstream, add variants opportunistically, or stop early only when results look good. To keep the process disciplined, pre-register your stopping criteria and guardrails internally, and treat changes as a new decision problem.

Practical stopping rules

Ship rule: Stop and ship B if P(lift > δ) > 0.90 and P(guardrail harm) < 0.05.
Reject rule: Stop and reject B if P(lift < 0) > 0.90.
Inconclusive rule: Stop if the credible interval for lift is entirely within [−ε, +ε], meaning the effect is practically negligible either way.
Time cap: Stop after N days regardless, then decide with the posterior you have.

Guardrails and Multi-Metric Decisions

Why a single “winner” metric is not enough

Real experiments rarely optimize only one metric. A checkout change might increase conversion but also increase refund rate. A recommendation tweak might raise clicks but reduce long-term retention. Bayesian A/B testing shines when you treat the decision as multi-objective: maximize expected value while controlling risk on guardrails.

Two practical patterns

Pattern 1: Primary metric with probabilistic guardrails. Decide based on the primary metric, but require that each guardrail has low probability of unacceptable harm. For example: ship if P(revenue lift > $0.01) > 0.85 AND P(refund rate increase > 0.2%) < 0.05 AND P(latency increase > 50ms) < 0.10.

Pattern 2: Convert everything to expected utility. Translate metrics into a single utility function, such as expected profit per user: U = revenue − cost_of_support − expected_refund_cost − latency_penalty. Then compute the posterior distribution of U for A and B and decide based on P(UB > UA) and expected regret in utility units. This requires more modeling assumptions but produces decisions that align tightly with business value.

Beyond Binary Conversion: Revenue per User and Heavy Tails

Why revenue is harder than conversion

Revenue per user is often skewed: many users spend $0, a few spend a lot. Simple normal approximations can be fragile. A Bayesian workflow based on posterior sampling still works, but you need a model that can handle skew and outliers or a robust approach.

Practical modeling options

Two-part model: Model purchase incidence (did the user buy?) and purchase amount (how much, given they bought). Then combine them to get revenue per user. This separates “more buyers” from “bigger baskets.”
Robust likelihood: Use a distribution with heavier tails (e.g., Student-t) for log-revenue or for spend among purchasers.
Nonparametric bootstrap with Bayesian flavor: In some settings, teams approximate the posterior of the mean revenue using resampling and a weak prior; while not fully Bayesian, it can be a pragmatic bridge when parametric assumptions are contentious.

Regardless of the model, the decision layer stays the same: compute posterior draws of the metric (or utility), then compute P(exceeds threshold), expected lift, and regret.

Handling Multiple Variants and “Winner’s Curse”

Multi-armed experiments

When you test A vs B vs C vs D, the probability that the best-looking variant is truly best is lower than it appears, especially early. Bayesian analysis can quantify this directly: compute P(variant k is best) and expected regret for choosing each variant. This naturally accounts for uncertainty without relying on multiple-comparison corrections framed around p-values.

Practical selection rules

Pick the variant with highest expected utility if its expected regret is below a threshold.
Pick the simplest variant among those whose expected utility is within a small margin of the best (a “good enough” rule that reduces overfitting to noise).
Require a probability-of-being-best threshold (e.g., > 0.70) only when the cost of a wrong choice is high; otherwise use regret-based rules.

Implementation Template: Monte Carlo Decision Metrics

The following pseudocode shows the analysis layer that sits on top of your posterior sampler. It assumes you can generate draws for the metric of each variant (conversion rate, revenue per user, utility, etc.).

# Inputs: arrays of posterior draws for metric under A and B (same length M)  # pA[1..M], pB[1..M]  # Practical threshold delta (absolute units)  # Choose-B regret threshold r_max  lift = pB - pA  prob_superiority = mean(pB > pA)  prob_exceeds_delta = mean(lift > delta)  expected_lift = mean(lift)  ci_lift = quantile(lift, [0.05, 0.95])  regret_choose_B = mean(max(0, pA - pB))  prob_loss = mean(pB < pA)  # Decision rule example  if prob_exceeds_delta > 0.90 and regret_choose_B < r_max:      decision = "ship B"  elif prob_loss > 0.90:      decision = "reject B"  else:      decision = "continue / inconclusive"

This template is intentionally metric-agnostic. The modeling work is in generating pA and pB draws; the decision work is in choosing δ, r_max, and any guardrail constraints.

Common Failure Modes (and How to Avoid Them)

Failure mode: treating P(B > A) as a binary gate

If you ship whenever P(B > A) > 0.95, you will sometimes ship tiny improvements that are not worth it and occasionally ship changes with meaningful downside risk. Fix this by using δ and regret, not just superiority.

Failure mode: changing thresholds mid-experiment

Bayesian updating does not protect you from moving goalposts. If you lower δ or relax guardrails after seeing the data, you are no longer following the original decision policy. Fix this by writing the decision thresholds before launch and treating changes as a new decision problem.

Failure mode: ignoring heterogeneous effects

An overall positive lift can hide losses in key segments (e.g., new users, mobile users, high-value customers). A practical approach is to compute segment-level posteriors and then apply guardrail-like constraints: ship only if the probability of harm in critical segments is low. When segments are many, use partial pooling so you do not overreact to noisy segment estimates.

Failure mode: optimizing a proxy metric without a utility link

Click-through rate might increase while revenue decreases. Bayesian A/B testing will faithfully quantify uncertainty in the proxy, but it cannot fix a misaligned objective. Fix this by either choosing a better primary metric or building a utility function that connects proxies to value.

Choosing What to Report to Stakeholders

For decision-makers, a compact Bayesian A/B report typically includes: (1) expected lift with a credible interval; (2) P(lift > 0) and P(lift > δ); (3) probability of loss and expected regret; (4) guardrail risk probabilities; (5) the decision rule and whether it was met. This shifts the conversation from “Is it significant?” to “Is it worth it, and how risky is it?”

Now answer the exercise about the content:

Which decision rule best reflects a Bayesian A/B testing approach focused on practical impact rather than a p-value-like gate?

You are right! Congratulations, now go to the next page

You missed! Try again.

Bayesian decisions should use quantities tied to business outcomes, such as P(lift > δ), expected lift, expected regret, and guardrail risk, rather than treating P(B > A) as a binary gate or requiring peeking corrections.