All courses > School Subjects > Statistics ::

Confidence Intervals for A/B Testing: Quantifying Uncertainty in Differences

Capítulo 6

Estimated reading time: 10 minutes

1) What a confidence interval represents (operationally) and what it does not

A confidence interval (CI) is best understood as a procedure that produces an interval from your sample data. If you were to repeat the same experiment many times under the same conditions (same randomization, same sample sizes, same measurement rules), and each time compute a 95% CI using the same method, then about 95% of those intervals would contain the true underlying effect (e.g., the true difference in conversion rates between variant and control).

This is the procedure-based interpretation: the randomness is in the data and the interval, not in the fixed (but unknown) true effect.

What a CI does not mean

Not: “There is a 95% probability the true effect lies in this specific interval.” (That statement is Bayesian; a frequentist CI does not assign probability to the fixed parameter.)
Not: “If the CI includes 0, there is no effect.” It means the data are consistent with effects on both sides of 0 at the chosen confidence level; the interval may still include practically meaningful effects.
Not: “A narrower CI means a larger effect.” Width is about uncertainty, not magnitude.
Not: “95% CI guarantees 95% chance of making the right decision.” Decision quality also depends on costs, thresholds, and how you act on the interval.

Operationally in A/B testing, a CI is a compact way to answer: Which effect sizes are reasonably compatible with what we observed? That makes it a natural tool for decision-making when paired with practical thresholds (e.g., minimum worthwhile lift, maximum acceptable harm).

2) Building intuition for interval width: sample size, variance, and allocation

Most CIs for A/B effects follow the same skeleton:

estimate ± (critical value) × (standard error)

For a 95% CI under common approximations, the critical value is about 1.96 (from the normal distribution). The standard error (SE) is the key driver of width.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Sample size: more data → smaller SE → narrower CI

For many estimators, SE shrinks roughly like 1/√n. Doubling the sample size does not halve the CI width; it reduces it by about 1/√2 ≈ 0.707. To cut width in half, you need about 4× the sample size.

Variance: noisier metrics → wider CI

Binary outcomes (converted vs not) have variance tied to the conversion rate. Continuous outcomes (like revenue per user) can have heavy tails and high variance, often making intervals wider for the same n.

For proportions, variance is approximately p(1-p), largest near p=0.5.
For means, variance is the population variance of the metric; outliers and skew increase it.

Allocation: balanced groups usually minimize SE

For a fixed total sample size N, splitting traffic close to 50/50 typically minimizes the SE of a difference (in proportions or means). Unequal allocation can be justified (e.g., risk control, ramping), but it generally widens the CI for the same total N.

Rule of thumb: if you move from 50/50 to 90/10 while keeping N fixed, the smaller group dominates uncertainty and the CI widens substantially.

Confidence level: higher confidence → wider interval

90% CIs are narrower than 95% CIs; 99% CIs are wider. Choose the confidence level to match how conservative you want to be about uncertainty, and keep it consistent in reporting.

3) Reading intervals for difference and for lift (and handling sign changes)

Difference scale (absolute effect)

For conversion, the absolute difference is:

Δ = p_variant − p_control

A CI for Δ directly answers: “How many percentage points could we plausibly be up or down?” This is often the clearest scale for product decisions because it maps to absolute impact.

Lift scale (relative effect)

Relative lift is:

Lift = (p_variant − p_control) / p_control

Intervals for lift can be useful for communicating proportional change, but they can be tricky when:

Baseline is small (a tiny p_control can make lift look huge and unstable).
The difference crosses 0. If the CI for Δ includes 0, then the lift CI typically includes 0 as well, but the relative scale can exaggerate asymmetry.

Practical tip: compute and decide on the difference scale, then translate to lift for communication.

Sign changes and “directional ambiguity”

If a CI spans negative to positive values (e.g., [−0.4pp, +1.2pp]), the data are consistent with both harm and benefit. This is not “no effect”; it is “uncertain direction.” In such cases, decisions should be driven by whether the interval includes effects you consider unacceptable (harm) and whether it includes effects you consider worthwhile (benefit).

Practical equivalence (ROPE-style thinking)

Many teams implicitly have a “close enough to zero” zone where differences are not worth acting on. Define a practical equivalence band around 0, such as [−0.1pp, +0.1pp] for conversion, or [−$0.02, +$0.02] for revenue per user. Then interpret:

CI entirely inside the band → effect is practically negligible (even if statistically non-zero in some settings).
CI entirely outside the band on the positive side → practically meaningful improvement.
CI overlaps the band → cannot rule out negligible effect.

This complements (and often improves on) binary “significant/not significant” thinking by directly encoding what matters to the business.

4) Decision framing: ship when the interval excludes harmful effects and meets a minimum practical effect

A robust decision rule uses two thresholds:

Harm threshold (guardrail): the worst acceptable effect (often 0 for primary metrics, or a small negative tolerance if trade-offs exist).
Minimum practical effect (MPE): the smallest improvement worth shipping (similar in spirit to an MDE, but used for decision-making rather than planning).

Example decision rule on the difference scale

Let Δ be the estimated effect (variant − control) and [L, U] its 95% CI.

Ship if L ≥ MPE (you can be confident the improvement is at least the minimum worthwhile).
Do not ship if U ≤ Harm (you can be confident it is harmful beyond tolerance).
Inconclusive / iterate otherwise (collect more data, refine variant, or decide based on costs and priors).

Commonly, Harm = 0 for the primary metric, and MPE is a positive threshold like +0.2pp conversion or +$0.05 revenue per user.

Why this is better than “CI excludes 0”

“Excludes 0” answers whether you can rule out exactly zero effect, which is rarely the real question. The real question is usually: “Can we rule out unacceptable harm, and can we guarantee enough upside to justify rollout?” The CI supports both.

5) Common reporting formats and how to avoid misleading visuals

Recommended reporting table

Metric	Control	Variant	Δ (abs)	95% CI for Δ	Lift	Decision thresholds
Conversion	p_c	p_v	p_v−p_c	[L, U]	(p_v−p_c)/p_c	MPE, Harm
Revenue/user	μ_c	μ_v	μ_v−μ_c	[L, U]	(μ_v−μ_c)/μ_c	MPE, Harm

Include both absolute and relative effects, but make the CI prominent on the absolute difference.

Forest plot (dot-and-whisker) done right

A clean visual is a forest plot: point estimate with horizontal CI bars, centered on 0. To avoid misleading impressions:

Use a common x-axis scale across metrics when comparing variants; otherwise widths are not comparable.
Always include a vertical line at 0 and optionally lines for Harm and MPE.
Do not truncate the axis tightly around the point estimate; it exaggerates certainty.
Label units clearly (percentage points vs percent lift vs dollars).

Avoid these common pitfalls

Bar charts of conversion rates without uncertainty: they hide overlap and encourage over-interpretation of small differences.
Dual axes: can make small changes look large.
Only reporting lift: can inflate perceived impact when baseline is small.
Rounding away the story: report enough precision for the CI bounds (e.g., 0.01pp or $0.01 depending on scale).

6) Worked examples (conversion rates and revenue per user)

Example A: Conversion rate difference (two-proportion CI)

Scenario: An onboarding change is tested. Outcome is conversion (yes/no). You want a 95% CI for the absolute difference in conversion rates.

Control: n_c = 10,000, conversions x_c = 1,200 → p_c = 0.12
Variant: n_v = 10,000, conversions x_v = 1,300 → p_v = 0.13

Step 1: Compute the point estimate (difference)

Δ = p_v − p_c = 0.13 − 0.12 = 0.01  (i.e., +1.0 percentage point)

Step 2: Compute the standard error (unpooled for CI)

A common CI for the difference in proportions uses the unpooled SE:

SE(Δ) = sqrt( p_c(1−p_c)/n_c + p_v(1−p_v)/n_v )

p_c(1−p_c)/n_c = 0.12×0.88/10000 = 0.00001056  (≈ 1.056e−5)
p_v(1−p_v)/n_v = 0.13×0.87/10000 = 0.00001131  (≈ 1.131e−5)
SE = sqrt(0.00002187) ≈ 0.00468

Step 3: Apply the 95% critical value

95% CI = Δ ± 1.96×SE = 0.01 ± 1.96×0.00468

Margin ≈ 0.00917

CI ≈ [0.01 − 0.00917, 0.01 + 0.00917] = [0.00083, 0.01917]

Interpretation

On the absolute scale, the data are consistent with an improvement as small as +0.083pp and as large as +1.917pp (at 95% confidence, in the procedure sense).
The interval is entirely above 0, so you can rule out harm to conversion at this confidence level.

Decision framing with thresholds

Suppose your minimum practical effect is MPE = +0.5pp and harm threshold is Harm = 0pp.

Harm check: L = +0.083pp ≥ 0 → passes “no harm” for conversion.
Worthwhile check: L = +0.083pp is below +0.5pp → you cannot guarantee the improvement is big enough.

Result: statistically positive but not yet decision-secure under the “ship only if CI clears MPE” rule. Options include running longer, improving the variant, or shipping if costs are low and you accept the risk that true lift may be small.

Reading the same result as lift

Point lift:

Lift = 0.01 / 0.12 ≈ 8.33%

But for decision-making, keep the CI on the difference scale. If you must communicate lift uncertainty, translate the CI bounds approximately:

Lift bounds (approx) = [0.00083/0.12, 0.01917/0.12] ≈ [0.69%, 15.98%]

Note how the relative numbers look large; this is why absolute pp is often more grounded.

Example B: Revenue per user difference (two-mean CI)

Scenario: A pricing page tweak is tested. Outcome is revenue per user (RPU) over a fixed window. You want a 95% CI for the difference in means.

Control: n_c = 5,000, sample mean ȳ_c = $1.20, sample SD s_c = $4.00
Variant: n_v = 5,000, sample mean ȳ_v = $1.35, sample SD s_v = $4.20

Step 1: Point estimate

Δ = ȳ_v − ȳ_c = 1.35 − 1.20 = $0.15

Step 2: Standard error for difference in means

SE(Δ) = sqrt( s_c^2/n_c + s_v^2/n_v )

s_c^2/n_c = 16/5000 = 0.0032
s_v^2/n_v = 17.64/5000 = 0.003528
SE = sqrt(0.006728) ≈ 0.0820

Step 3: 95% CI (large-sample normal approximation)

CI = 0.15 ± 1.96×0.0820 = 0.15 ± 0.1607

CI ≈ [−$0.01, +$0.31]

Interpretation

The data are consistent with a small loss (about one cent per user) up to a gain of about 31 cents per user.
The interval crosses 0, so the direction is uncertain at 95% confidence.

Decision framing with thresholds

Suppose:

Harm threshold: Harm = −$0.05 (you can tolerate up to 5 cents loss per user due to strategic reasons)
Minimum practical effect: MPE = +$0.10

Harm check: lower bound −$0.01 is above −$0.05 → you can rule out “too harmful” outcomes.
Worthwhile check: lower bound −$0.01 is below +$0.10 → you cannot guarantee the gain is at least 10 cents.

Result: safe enough but not proven worthwhile under a conservative shipping rule. If rollout is cheap and reversible, you might run a follow-up test or extend duration to tighten the interval.

How to tighten the RPU interval (what to try next)

Increase sample size (most direct lever).
Reduce variance via metric design (e.g., winsorization, log transforms, or using a more stable proxy), if consistent with your measurement policy.
Balance allocation if it was skewed.
Segment carefully: if variance differs drastically by segment, pre-planned stratification can improve precision; avoid post-hoc slicing that creates misleading intervals.

Checklist for interpreting any CI in an A/B readout

What is the estimand and scale (absolute difference vs lift)?
What are the CI bounds in business units (pp, $, minutes)?
Does the interval include harmful effects beyond your tolerance?
Does the interval clear your minimum practical effect?
If inconclusive, is the width driven more by sample size, variance, or allocation?

Now answer the exercise about the content:

In A/B testing, which interpretation best matches how a 95% confidence interval should be used for decisions about shipping a change?

You are right! Congratulations, now go to the next page

You missed! Try again.

A confidence interval describes a procedure and the range of effects compatible with what was observed. For decisions, compare the CI bounds to a harm threshold and a minimum practical effect, not just whether it crosses 0.