1) Observed metric as an estimate of a true underlying rate/mean
In an A/B test you observe a metric (conversion rate, average revenue per user, time on task, etc.). What you care about is not the observed number itself, but the true underlying value for the population you are sampling from during the experiment period.
Think of each user outcome as a random draw from a distribution. The observed metric is a summary of those draws:
- Proportion (rate): conversion rate is the sample proportion
p̂ = x/n, wherexis number of conversions andnis number of users. - Mean: average order value is the sample mean
ȳ = (1/n)∑yᵢ.
These are estimators: rules that convert data into a single number intended to approximate an unknown parameter (the true conversion probability p or true mean μ).
Practical step-by-step: mapping “what you see” to “what you want”
- Step 1: Define the unit of analysis (often user) and the per-unit outcome
yᵢ(e.g.,yᵢ=1if user converts, else0). - Step 2: Collect
noutcomes in each variant. - Step 3: Compute the estimator:
p̂for binary metrics orȳfor continuous metrics. - Step 4: Treat that number as an estimate with uncertainty, not as the truth.
A key mental model: the observed metric is one noisy snapshot of what could have happened under the same conditions.
2) Variance and standard error: intuitive meaning and why noisy metrics need more data
Variability is why A/B testing needs statistics. Even if Variant B has no real effect, the observed metrics for A and B will almost never match exactly because user outcomes vary.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Variance: how spread out individual outcomes are
Variance measures the typical squared deviation of outcomes from their mean. Intuitively:
- Low variance: users behave similarly; the metric is stable.
- High variance: users differ a lot; the metric is noisy.
Examples:
- Binary conversion (
0/1): variability is limited; the per-user variance isp(1-p). - Revenue: often heavy-tailed (many small purchases, few huge ones), leading to high variance.
Standard error: how uncertain your estimate is
The standard error (SE) is the typical size of random error in the estimator (sample mean or sample proportion). It shrinks as sample size grows.
- For a sample mean:
SE(ȳ) ≈ s/√n, wheresis the sample standard deviation. - For a sample proportion:
SE(p̂) ≈ √(p̂(1-p̂)/n).
Two practical implications:
- More data reduces uncertainty roughly like
1/√n. To cut SE in half, you need about 4× the sample size. - Noisier metrics need more data. If outcomes vary widely (large
s), SE is larger at the samen.
Numeric example: same sample size, different noise
Suppose each variant has n=10,000 users.
- Conversion rate: if
p̂=0.10, thenSE ≈ √(0.10×0.90/10,000) ≈ 0.003(0.3 percentage points). - Revenue per user: if the standard deviation is
s=$20, thenSE ≈ 20/√10,000 = $0.20.
Even before comparing A vs B, you can see how “tight” your estimate is likely to be.
3) Sampling distributions and why repeated samples would vary
If you could rerun the same experiment many times under identical conditions, you would get different observed metrics each time. The distribution of those possible observed metrics is the sampling distribution.
This idea is the backbone of inference:
- Your observed
p̂orȳis one draw from its sampling distribution. - The sampling distribution has a center (around the true parameter) and a spread (the standard error).
Why this matters in A/B testing
- It explains why “A=10.2%, B=10.6%” might happen even if the true rates are equal.
- It provides a way to quantify how surprising an observed difference is under “no effect.”
Conceptual simulation (no code required)
Imagine Variant A and B truly have the same conversion probability p=0.10, and each gets n=1,000 users.
- Repeat 1: A gets 92 conversions (9.2%), B gets 108 (10.8%) → difference = +1.6 pp.
- Repeat 2: A gets 103 (10.3%), B gets 97 (9.7%) → difference = −0.6 pp.
- Repeat 3: A gets 101 (10.1%), B gets 100 (10.0%) → difference = −0.1 pp.
Across repeats, the differences bounce around zero. That bouncing is not “bad experimentation”; it is expected sampling variability.
4) Difference in means/proportions as the core estimator for treatment effect
The core quantity in most A/B tests is the treatment effect estimator: the difference between the treatment and control metrics.
- Difference in proportions (binary metric):
Δ̂ = p̂_B − p̂_A - Difference in means (continuous metric):
Δ̂ = ȳ_B − ȳ_A
This difference is your estimate of the true effect Δ (the underlying change in conversion probability or mean outcome caused by the treatment).
Standard error of a difference: uncertainty adds up
Because both A and B are noisy estimates, the difference is also noisy. A common approximation (when samples are independent) is:
- For means:
SE(ȳ_B − ȳ_A) ≈ √(s_B²/n_B + s_A²/n_A) - For proportions:
SE(p̂_B − p̂_A) ≈ √(p̂_B(1-p̂_B)/n_B + p̂_A(1-p̂_A)/n_A)
Practical reading: if either group is small or highly variable, the difference will be uncertain.
Lift is just a rescaling of the same estimator
Teams often report relative lift:
Lift = (p̂_B − p̂_A) / p̂_A
Lift can be useful for communication, but inference is usually driven by the absolute difference and its standard error. Relative lift can look dramatic when the baseline is small.
5) Practical interpretation of spread vs signal (numeric examples + small simulations)
Decision-making in A/B testing is largely about comparing:
- Signal: the estimated effect size (difference between variants).
- Spread: the uncertainty around that estimate (standard error; width of plausible values).
Example A: a visible difference that is still noisy
Variant A: n_A=1,000, x_A=100 → p̂_A=10.0%
Variant B: n_B=1,000, x_B=120 → p̂_B=12.0%
Estimated effect: Δ̂=+2.0 percentage points.
Compute uncertainty (approx.):
SE ≈ √(0.12×0.88/1000 + 0.10×0.90/1000) ≈ √(0.0001056 + 0.00009) ≈ 0.014 (1.4 percentage points)
Interpretation: the signal (2.0 pp) is only about 1.4× the spread. You should expect substantial run-to-run variation; the observed lift could plausibly shrink or even flip with another sample of 1,000 users per group.
Example B: tiny difference, huge sample
Variant A: n_A=200,000, p̂_A=10.00%
Variant B: n_B=200,000, p̂_B=10.20%
Estimated effect: Δ̂=+0.20 pp.
Approximate SE using p≈0.10:
SE ≈ √(0.10×0.90/200000 + 0.10×0.90/200000) = √(2×0.09/200000) ≈ 0.00095 (0.095 pp)
Interpretation: the signal (0.20 pp) is about 2.1× the spread. Even a small effect can be estimated precisely with enough data. Whether it is worth acting on depends on business impact and metric validity, not only statistical precision.
Conceptual simulation: “what differences are typical under no effect?”
You can sanity-check intuition by imagining a quick Monte Carlo exercise:
- Setup: Assume no true effect:
p_A=p_B=0.10, and choose a sample size (e.g.,n=2,000per group). - Repeat many times: For each repeat, draw conversions for A and B (like flipping biased coins), compute
p̂_A,p̂_B, andΔ̂. - Look at the distribution of Δ̂: You will see a bell-shaped cloud centered near 0, with most differences small and occasional larger swings.
Practical takeaway: if your observed Δ̂ sits in the “common” part of that cloud, it may be mostly noise; if it sits in the tails, it is less consistent with no effect.
Spread vs signal checklist
| Question | What to compute/inspect | What it tells you |
|---|---|---|
| Is the estimate stable? | Standard error (or interval width) | How much the estimate would vary across repeats |
| Is the effect large relative to noise? | |Δ̂| / SE(Δ̂) | Rough “signal-to-noise” ratio |
| Is the metric inherently noisy? | Outcome variance / heavy tails | Whether you should expect slow learning |
| Will more data help meaningfully? | How SE scales with n | Whether additional sample size will materially tighten uncertainty |
6) Clarifying common misconceptions
Misconception: “More data always means better decisions”
More data reduces random error (variance) but does not fix systematic error (bias) or a poorly chosen metric.
- Bias example: If your tracking misses conversions for a subset of users (e.g., ad blockers or attribution gaps), increasing sample size makes the biased estimate more precise—precisely wrong.
- Metric validity example: If the metric is a proxy that can be gamed (e.g., clicks that don’t translate to value), more data can confidently push you toward optimizing the wrong behavior.
Practical step-by-step: before scaling sample size, verify (a) the metric measures what you intend, (b) instrumentation is consistent across variants, and (c) the unit/outcome definition matches the decision you want to make.
Misconception: “If the observed lift is positive, B is better”
A positive Δ̂ can occur by chance. The question is whether the lift is large relative to its uncertainty and whether it is practically meaningful.
Practical habit: always pair an effect estimate with an uncertainty summary (SE or interval) and sanity-check whether the implied business impact is material.
Misconception: “Variance is a nuisance; ignore it and focus on averages”
Variance determines how quickly you can learn. Two changes can have the same average lift but very different detectability depending on outcome spread.
- Low-variance metrics (e.g., conversion) often yield faster learning per user.
- High-variance metrics (e.g., revenue) may require much larger samples or alternative estimators/metric designs to stabilize measurement.
Misconception: “A single run tells the full story”
Because your result is one draw from a sampling distribution, it is normal for repeated runs (or different slices of traffic) to show variation. This does not automatically mean the experiment is flawed; it means uncertainty is real.
Practical step-by-step: when results feel fragile, check whether the estimated effect is small relative to SE, whether the metric is noisy, and whether sample sizes are balanced enough to keep SE(Δ̂) under control.
Misconception: “Precision equals correctness”
A narrow uncertainty band means low random error, not necessarily a correct causal interpretation or a useful business decision. Precision can coexist with:
- Bias (systematic measurement error)
- Misaligned metrics (optimizing a proxy)
- Non-representative samples (traffic not matching the rollout population)
The statistical building blocks in this chapter help you quantify uncertainty from randomness; they are one component of making reliable product and marketing decisions.