All courses > School Subjects > Statistics ::

Decision Metrics for Experiments: Probability of Superiority, Expected Loss, and Thresholds

Capítulo 8

Estimated reading time: 12 minutes

Why decision metrics matter after you have a posterior

In an experiment (A/B test, pricing test, onboarding change), a posterior distribution gives you uncertainty about each variant’s performance. But a posterior by itself does not tell you what to do. Decision metrics translate posterior uncertainty into action-oriented quantities: “How likely is B better than A?”, “What is the expected cost if we choose B and it’s actually worse?”, and “What minimum improvement do we require before shipping?”. This chapter focuses on three practical metrics that answer those questions: probability of superiority, expected loss (also called expected regret), and decision thresholds.

These metrics are especially useful when you must decide under constraints: limited traffic, limited engineering capacity, risk tolerance, or asymmetric costs (e.g., a small conversion gain is not worth a large risk of revenue loss). They also help you communicate decisions to stakeholders without relying on p-values or binary significance language.

Setup and notation used throughout

Assume you ran an experiment with two variants, A (control) and B (treatment). Let θA and θB be the true (unknown) performance parameters you care about. For concreteness, think of θ as a conversion rate, but the same ideas apply to average order value, retention, churn, time-to-complete, or any metric with a posterior distribution. After observing data, you have a posterior distribution for θA and θB. In practice you often work with posterior samples: draw many pairs (θA(s), θB(s)) from the joint posterior, for s = 1…S. From these samples you can compute decision metrics by simple counting and averaging.

Define the lift (difference) as Δ = θB − θA. Sometimes you care about relative lift, ρ = (θB − θA)/θA, but difference is easier for expected loss calculations. If your metric is “lower is better” (e.g., churn), you can flip signs or define Δ = θA − θB so that “positive is better” remains consistent.

Probability of superiority (PoS)

What it is

Probability of superiority is the posterior probability that one variant is better than the other. For B vs A, it is PoS = P(θB > θA | data). This is a direct answer to a common stakeholder question: “What’s the chance B beats A?” Unlike a p-value, it is a probability about the parameter given the data and model.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

How to compute it from posterior samples

If you have posterior samples, compute PoS by counting how often B exceeds A:

PoS = (1/S) * sum_{s=1..S} I(θB(s) > θA(s))

Where I(·) is an indicator function that equals 1 when the condition is true and 0 otherwise. If 9,200 out of 10,000 samples have θB(s) > θA(s), then PoS ≈ 0.92.

Step-by-step workflow

Step 1: Fit your Bayesian model for the metric of interest and obtain posterior samples for θA and θB (or directly for Δ).
Step 2: For each posterior draw s, compare θB(s) and θA(s).
Step 3: Average the indicator results to get PoS.
Step 4: Report PoS alongside a practical effect size summary (e.g., median Δ and a credible interval for Δ) so “likely better” is not confused with “meaningfully better”.

Practical interpretation and common pitfalls

PoS is easy to understand, but it can be misleading if used alone. A variant can have PoS = 0.95 while the expected improvement is tiny, especially with large sample sizes or low-variance metrics. Conversely, PoS might be 0.70 with a potentially large upside, which could still be worth taking if the downside is limited. This is why PoS is best treated as a “directional confidence” metric, not a complete decision rule.

Another pitfall is ignoring business impact. If θ is conversion rate, a 0.1% absolute lift may be negligible for a small product but huge for a high-traffic business. PoS does not encode value; expected loss and thresholds help incorporate it.

Expected loss (expected regret) as a decision metric

What it is

Expected loss quantifies the average penalty you expect to incur by choosing a variant, given posterior uncertainty. It answers: “If we ship B, how much do we expect to lose compared to the best choice, because we might be wrong?” This is closer to how real decisions are made: you choose an action, and if reality differs from your belief, you pay a cost.

To define expected loss, you need a loss function. A simple and widely useful one is regret relative to the better variant in each posterior draw. If you choose B, then in a given draw s your regret is max(0, θA(s) − θB(s)) because you lose only when B is worse than A, and the loss magnitude is the performance gap. The expected loss of choosing B is the posterior average of that regret.

Computing expected loss from posterior samples

Let L(B) be expected loss if you choose B:

L(B) = (1/S) * sum_{s=1..S} max(0, θA(s) - θB(s))

Similarly, expected loss for choosing A is:

L(A) = (1/S) * sum_{s=1..S} max(0, θB(s) - θA(s))

Notice a useful symmetry: L(A) is the expected upside you miss if B is actually better, while L(B) is the expected downside you suffer if B is actually worse.

Step-by-step workflow

Step 1: Obtain posterior samples (θA(s), θB(s)).
Step 2: For each draw, compute regret if choosing A and regret if choosing B.
Step 3: Average regrets across draws to get L(A) and L(B).
Step 4: Choose the variant with smaller expected loss, unless other constraints apply (engineering cost, rollout risk, fairness constraints, etc.).

Turning metric loss into business loss

Expected loss becomes more actionable when expressed in business units. Suppose θ is conversion rate and each conversion yields an average contribution margin of $m. If traffic volume over the decision horizon is N users, then an absolute conversion-rate gap of g corresponds to expected margin difference of approximately N * m * g. You can scale regret draws by N*m to express expected loss in dollars.

Example: You plan to ship the winner for the next 30 days, expecting N = 2,000,000 eligible users and margin per conversion m = $12. If L(B) in conversion-rate units is 0.0004 (0.04% absolute), then expected dollar loss of choosing B is about 2,000,000 * 12 * 0.0004 = $9,600 over that horizon. This framing makes trade-offs explicit: maybe $9,600 is acceptable given speed-to-market, or maybe it is too high relative to the expected upside.

Asymmetric losses and risk sensitivity

Many decisions have asymmetric costs. A small drop in conversion might be more painful than an equally sized gain is beneficial (e.g., due to brand risk, customer support load, or contractual obligations). You can encode this by weighting negative outcomes more heavily. For instance, define regret for choosing B as w * max(0, θA(s) − θB(s)), where w > 1 increases the penalty for being worse than control. This pushes decisions toward safer options when downside risk matters.

You can also use nonlinear loss functions. For example, if drops beyond a certain point are catastrophic, you can apply a convex penalty that grows faster than linearly. The key is that expected loss is not a single formula; it is a framework: pick a loss function that matches your business reality, then average it under the posterior.

Thresholds: defining “worth it” before you decide

Why thresholds are needed

Even if B is likely better, you might not want to ship it unless the improvement is large enough to justify implementation cost, opportunity cost, or risk. Thresholds formalize “minimum practical effect” and “maximum acceptable risk.” They prevent teams from shipping changes that are statistically persuasive but operationally irrelevant.

Thresholds can be applied to effect size (Δ), to probability statements (PoS), to expected loss, or to combinations of these. The best threshold depends on your decision context: one-way door vs two-way door decisions, rollout reversibility, and the cost of being wrong.

Effect-size thresholds (minimum detectable value, MDV)

An effect-size threshold sets a minimum improvement δ that must be exceeded to consider B a win. Instead of asking P(Δ > 0), you ask P(Δ > δ). This is often called probability of exceeding a practical threshold.

P_practical = P(Δ > δ | data) ≈ (1/S) * sum I(Δ(s) > δ)

Example: If δ = 0.002 (0.2% absolute conversion lift) and P(Δ > δ) = 0.80, you might decide B is promising but not yet strong enough to ship if your policy requires 0.90. This is more aligned with business value than PoS alone.

Risk thresholds (guardrails on harm)

Sometimes you care less about “is it better?” and more about “how likely is it to be meaningfully worse?” A risk threshold sets a maximum acceptable probability that B is worse than A by more than some harm tolerance h. That is, require P(Δ < −h) to be small.

Risk_harm = P(Δ < -h | data) ≈ (1/S) * sum I(Δ(s) < -h)

Example: You might require Risk_harm ≤ 0.05 for a checkout change, where h = 0.001 (0.1% absolute conversion drop). This creates a safety-first rule: even if upside exists, you do not ship if there is too much probability of a meaningful drop.

Expected-loss thresholds (value of information thinking)

Expected loss can also be used as a stopping and decision threshold. A common policy is: “Choose the variant with lower expected loss once the minimum expected loss is below a tolerance.” The tolerance can be expressed in metric units or dollars.

Example policy: Continue the experiment until min(L(A), L(B)) < $5,000 expected loss over the rollout horizon, then ship the lower-loss variant. This ties the decision to the cost of uncertainty. If you can reduce expected loss quickly by collecting more data, you keep running; if additional data barely changes expected loss, you stop.

Putting the metrics together: practical decision recipes

Recipe 1: Fast-moving product changes (two-way door)

For reversible changes (UI copy, minor layout), you can prioritize speed while controlling downside. A practical rule set might be: ship B if PoS ≥ 0.80, Risk_harm (with a small h) ≤ 0.10, and expected loss of choosing B is below a small dollar threshold over a short horizon. This combination avoids shipping something likely harmful while not demanding extreme certainty.

Step 1: Choose harm tolerance h (e.g., 0.05% absolute conversion) and compute Risk_harm.
Step 2: Compute PoS to ensure directionality.
Step 3: Compute L(B) in dollars over the next week or two.
Step 4: Ship if all thresholds pass; otherwise keep testing or abandon.

Recipe 2: High-stakes changes (one-way door)

For changes that are hard to roll back (pricing, core funnel redesign, policy changes), require stronger evidence and lower expected loss. Here you might set δ to a meaningful improvement and demand high probability of exceeding it, while also demanding very low probability of meaningful harm.

Step 1: Set δ based on implementation cost and strategic value (e.g., at least +0.5% absolute conversion or +$0.20 revenue per user).
Step 2: Require P(Δ > δ) ≥ 0.90 or 0.95.
Step 3: Set a strict harm guardrail: P(Δ < −h) ≤ 0.01.
Step 4: Evaluate expected loss in dollars over the full expected lifetime until the next major change; require it below a conservative tolerance.

Recipe 3: When variants have different costs

Sometimes B is more expensive to run (e.g., higher incentive cost, slower infrastructure, increased fraud risk). Then the decision should be based on net value, not raw metric lift. You can incorporate this by redefining θ as a net metric (e.g., profit per user) or by subtracting a known per-user cost c from B in each draw: θB_net(s) = θB(s) − c. Then compute PoS, expected loss, and thresholds on θ_net. This keeps the decision logic identical while ensuring you are optimizing what you actually care about.

Worked example with posterior samples (end-to-end)

Scenario

You tested a new onboarding flow (B) against control (A). Your model produced S = 10,000 posterior draws of conversion rates θA(s) and θB(s). You must decide whether to ship B for the next 14 days. Expected eligible traffic is N = 600,000 users over 14 days, and margin per conversion is m = $15. Implementation cost is already sunk, but you want to avoid shipping if there is more than a 5% chance of losing at least 0.15% absolute conversion.

Step 1: Compute PoS

Compute PoS = mean(I(θB(s) > θA(s))). Suppose you get PoS = 0.88. Directionally, B is likely better.

Step 2: Compute practical threshold probability

You decide that a lift smaller than δ = 0.001 (0.10% absolute) is not worth operational attention. Compute P(Δ > δ). Suppose P(Δ > 0.001) = 0.62. This suggests that while B is probably better, the improvement may often be small.

Step 3: Compute harm risk guardrail

Set h = 0.0015 (0.15% absolute drop). Compute Risk_harm = P(Δ < −0.0015). Suppose Risk_harm = 0.03. This passes your 5% risk tolerance.

Step 4: Compute expected loss in dollars

Compute L(B) = mean(max(0, θA(s) − θB(s))). Suppose L(B) = 0.00035 in conversion units. Convert to dollars over 14 days: $Loss = N * m * L(B) = 600,000 * 15 * 0.00035 = $3,150. If your acceptable expected loss tolerance is, say, $5,000 for a two-week decision, then shipping B is acceptable from a downside-risk perspective.

Step 5: Decide using a clear policy

A reasonable policy for this scenario could be: ship B if Risk_harm ≤ 0.05 and expected dollar loss ≤ $5,000, even if P(Δ > δ) is moderate, because the change is reversible and the downside is controlled. Under the computed values, B ships. If the change were hard to roll back, you might instead require P(Δ > δ) ≥ 0.90, in which case you would continue the experiment.

Implementation notes: computing metrics reliably

Use paired samples when possible

If your model produces a joint posterior where θA and θB are correlated (common in hierarchical or shared-parameter models), compute Δ(s) using paired draws from the same iteration. Do not independently shuffle samples between A and B; you would destroy correlation structure and distort PoS and expected loss.

Choose S large enough for stable tail probabilities

Risk guardrails often depend on tail events like P(Δ < −h). Estimating small probabilities (e.g., 1%) requires enough posterior draws. With 10,000 draws, the Monte Carlo standard error for a 1% probability is roughly sqrt(p(1-p)/S) ≈ 0.001, which is usually acceptable. With only 1,000 draws, tail estimates can be noisy.

Report multiple metrics to prevent misinterpretation

A compact decision dashboard for an experiment often includes: PoS, median Δ, P(Δ > δ), P(Δ < −h), and expected loss in business units. Each metric answers a different question: direction, magnitude, practical value, downside risk, and cost of uncertainty. Together they support decisions that are both data-driven and aligned with business constraints.

Now answer the exercise about the content:

Which choice best describes what expected loss L(B) measures when deciding to ship variant B over A?

You are right! Congratulations, now go to the next page

You missed! Try again.

Expected loss for choosing B averages max(0, θA − θB) over posterior draws, so it measures the expected downside from shipping B when it turns out worse than A, including the magnitude of the gap.