Power as “reliably detecting a meaningful effect”
In A/B testing, power is the probability that your experiment will flag a real effect of a chosen size (or larger) as “detectable” by your decision rule. Put plainly: if the true lift is actually worth caring about, power tells you how often your test will successfully notice it instead of missing it due to noise.
Power is easiest to internalize with a repeated-world thought experiment:
- Assume the true effect is exactly your minimum detectable effect (MDE) (the smallest lift you’d still act on).
- Run the same A/B test many times with the same sample size.
- Power is the fraction of those runs where you would correctly call a difference.
Common planning targets are 80% power (you detect the MDE 8 times out of 10) or 90% power (9 times out of 10). Higher power reduces missed wins, but costs more sample size and time.
Power vs. “confidence” in planning language
Teams often mix terms. For planning, you typically choose:
- Confidence level (how strict you are about false alarms; e.g., 95%).
- Power (how often you want to catch a real MDE; e.g., 80%).
- MDE (the smallest effect you care to detect).
Those three choices largely determine required sample size.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
What drives required sample size (and why)
Sample size is basically “how much averaging you need” to make signal stand out from noise. The main drivers are:
1) Baseline rate (for conversion metrics)
If your metric is a conversion rate, the baseline conversion probability p affects variability. A rough rule: conversion rates near 50% are the noisiest; very low or very high rates are less variable per user, but low rates make absolute changes tiny and therefore harder to detect in absolute terms.
For a binary conversion, the per-user variance is approximately p(1-p). This appears in most sample size formulas, so baseline matters.
2) Variance (for continuous metrics)
For revenue, time-on-page, or other continuous outcomes, higher variance means you need more users to average out noise. If you can reduce variance (e.g., by better metric definition, trimming extreme outliers, or using variance reduction methods), you reduce required sample size.
3) Minimum effect size (MDE): absolute vs relative
MDE is the effect you plan to be able to detect. Smaller MDEs require much larger samples. The most important intuition:
- Required sample size grows roughly with 1/(effect size)2.
- Halving the MDE (e.g., from 2% relative lift to 1%) requires about 4× the sample size.
Be explicit whether your MDE is:
- Absolute: +0.2 percentage points (e.g., 3.0% → 3.2%).
- Relative: +6.7% lift (e.g., 3.0% → 3.2%).
Planning is easiest in absolute terms for conversion rates, then translate to relative lift for business discussion.
4) Allocation (A/B split)
For a fixed total sample size, the most statistically efficient split is usually close to 50/50 because it minimizes the variance of the difference between groups. If you skew allocation (e.g., 90/10), you typically need more total users to achieve the same power.
Skewed splits can still be justified (risk management, ramping), but expect longer runtime for the same detectability.
5) Desired confidence and power targets
Stricter false-alarm control (higher confidence) and higher power both increase required sample size. Intuition: you’re demanding stronger evidence and fewer misses, so you need more data to satisfy both.
Rule-of-thumb summary table
| Change you make | Effect on required sample size | Why |
|---|---|---|
| Lower MDE | Increases sharply (~1/MDE2) | Harder to separate small signal from noise |
| Higher variance | Increases | More noise to average out |
| Baseline conversion near 50% | Increases (for conversion metrics) | Higher Bernoulli variance |
| More imbalanced split | Increases | Less information from the smaller group |
| Higher confidence / higher power | Increases | Stricter decision requirements |
Back-of-the-envelope runtime estimates (without a calculator)
You often need a fast “is this feasible?” estimate before doing a formal power calculation. The goal is not precision; it’s to avoid launching tests that can’t possibly answer the question in time.
A practical approximation for conversion rate tests
For a two-group A/B test on a conversion rate, a commonly used approximation for per-variant sample size is:
n_per_variant ≈ 16 * p(1-p) / (Δ)^2Where:
p= baseline conversion rate (as a probability, e.g., 0.05)Δ= absolute MDE (e.g., 0.005 for +0.5 percentage points)
This “16” constant roughly corresponds to planning around 95% confidence and ~80% power. It’s not exact, but it’s good for intuition and quick feasibility checks.
Step-by-step: estimate runtime from traffic
Inputs you need:
- Daily eligible users (or sessions) entering the experiment:
T - Allocation: typically 50/50, so each variant gets
T/2per day - Baseline conversion rate
p - MDE in absolute terms
Δ
Steps:
- Compute
p(1-p). - Compute
n_per_variant ≈ 16 * p(1-p) / Δ^2. - Compute days needed:
days ≈ n_per_variant / (T/2).
Worked example: signup conversion
Suppose:
T = 100,000eligible visitors/day- Baseline
p = 0.04(4%) - You care about
Δ = 0.004(0.4 percentage points; 4.0% → 4.4%)
Compute:
p(1-p) = 0.04 * 0.96 = 0.0384n_per_variant ≈ 16 * 0.0384 / (0.004)^2 = 0.6144 / 0.000016 ≈ 38,400- Per-variant traffic/day =
100,000/2 = 50,000 - Runtime ≈
38,400 / 50,000 ≈ 0.77 days
So this is feasible quickly (though in practice you may still run longer to cover weekday/weekend patterns and operational constraints).
Worked example: tiny lift on a low baseline
Suppose:
T = 50,000visitors/dayp = 0.01(1%)- You want to detect
Δ = 0.0005(0.05 percentage points; 1.00% → 1.05%), which is a 5% relative lift
Compute:
p(1-p) = 0.01 * 0.99 = 0.0099n_per_variant ≈ 16 * 0.0099 / (0.0005)^2 = 0.1584 / 0.00000025 ≈ 633,600- Per-variant traffic/day =
25,000 - Runtime ≈
633,600 / 25,000 ≈ 25.3 days
This is the same core lesson: small absolute effects require a lot of data, even if the relative lift sounds meaningful.
Why very small effects demand huge samples (and what that means for decision speed)
The “square law” is the key intuition: because uncertainty shrinks roughly with the square root of sample size, detecting an effect that is 10× smaller requires about 100× more users.
Consequences for product and marketing decisions:
- Timeliness tradeoff: If you need an answer in 1 week, you may have to accept a larger MDE (only detect bigger wins/losses).
- Portfolio thinking: If you can only detect large effects quickly, prioritize experiments that plausibly create large effects (new value props, pricing, onboarding changes) over micro-optimizations.
- Decision thresholds: If the business value of a tiny lift is real (e.g., at massive scale), you still need to budget the time and opportunity cost to measure it.
Quick scaling rules you can use in meetings
- If you reduce MDE from 1.0 pp to 0.5 pp, expect ~4× the sample size and ~4× the runtime (if traffic is fixed).
- If traffic doubles, runtime halves (sample size requirement stays the same).
- If you move from 50/50 to 80/20 allocation, expect longer runtime (sometimes substantially) for the same MDE.
Practical pitfalls: underpowered vs overpowered tests
Underpowered tests: noisy outcomes and “random winners”
An underpowered test is one where the sample size is too small for the MDE you care about. Typical symptoms:
- Results swing dramatically day to day.
- Confidence intervals are wide relative to the business impact.
- You see many “promising” lifts that disappear on rerun.
Practical guidance:
- Before launching, write down the MDE you would act on and sanity-check runtime with a back-of-the-envelope calculation.
- If the estimated runtime is longer than you can tolerate, don’t “just run it anyway.” Instead, change something: increase traffic eligibility, simplify variants, accept a larger MDE, or pick a higher-signal metric.
Overpowered tests: statistically detectable but practically irrelevant differences
At very large scale, you can detect extremely tiny differences. The risk is not that the math is wrong; it’s that the organization overreacts to effects that are too small to matter (or are within the range of operational variability).
Common failure modes:
- Shipping complexity for a 0.02% relative lift that is smaller than normal week-to-week fluctuation.
- Arguing over “significant” differences that are economically negligible.
- Optimizing local metrics at the expense of maintainability or user experience.
Practical guidance:
- Use an action threshold: decide in advance the smallest effect worth shipping, not just worth detecting.
- Pair statistical detectability with a business impact check (e.g., expected incremental conversions or revenue per week).
- When you have massive samples, focus on effect size and operational cost, not just detectability.
Scenario drills: estimate feasible detectable lift and decide if it’s worth running
Use these drills to practice the planning loop: (1) pick a time budget, (2) compute feasible sample size from traffic, (3) infer the MDE you can detect, (4) decide whether that MDE is worth it.
Drill 1: “We have 14 days—what lift can we detect?”
Situation:
- Eligible traffic:
T = 20,000users/day - Baseline conversion:
p = 0.05 - Time budget: 14 days
- Split: 50/50
Step 1: compute per-variant sample available
- Per-variant users/day =
10,000 n_per_variant_available = 14 * 10,000 = 140,000
Step 2: invert the approximation to solve for Δ
From n ≈ 16 * p(1-p) / Δ^2, rearrange:
Δ ≈ sqrt(16 * p(1-p) / n)Compute:
p(1-p) = 0.05 * 0.95 = 0.0475Δ ≈ sqrt(16 * 0.0475 / 140,000) = sqrt(0.76 / 140,000) = sqrt(0.0000054286) ≈ 0.00233
Interpretation: In ~14 days, you can detect about 0.23 percentage points absolute (5.00% → 5.23%), which is about 4.6% relative.
Decision question: Would you ship for a ~4–5% relative lift? If yes, the test is feasible. If you only care about 1% relative lift, it’s likely not worth running under this time constraint.
Drill 2: “Is this micro-optimization measurable this month?”
Situation:
T = 60,000users/dayp = 0.20- You want to detect a 1% relative lift (20.00% → 20.20%)
- So
Δ = 0.002absolute
Step 1: estimate required per-variant sample
p(1-p) = 0.2 * 0.8 = 0.16n_per_variant ≈ 16 * 0.16 / (0.002)^2 = 2.56 / 0.000004 = 640,000
Step 2: convert to runtime
- Per-variant users/day =
30,000 - Days ≈
640,000 / 30,000 ≈ 21.3 days
Decision question: If you can afford ~3 weeks and the change is low-risk, it may be feasible. If you need an answer in under a week, you either need a larger MDE (bigger change) or more traffic.
Drill 3: “Low baseline, big relative lift—still hard?”
Situation:
T = 40,000users/dayp = 0.005(0.5%)- You want a 10% relative lift (0.50% → 0.55%)
Δ = 0.0005absolute
Estimate sample size:
p(1-p) ≈ 0.005 * 0.995 ≈ 0.004975n_per_variant ≈ 16 * 0.004975 / (0.0005)^2 = 0.0796 / 0.00000025 ≈ 318,400
Runtime:
- Per-variant users/day =
20,000 - Days ≈
318,400 / 20,000 ≈ 15.9 days
Interpretation: Even a “big” 10% relative lift can take ~2+ weeks when the baseline is very low, because the absolute change is tiny.
Drill 4: “Should we run this test at all?” (worth-it checklist)
Before you invest time:
- Feasibility: Using traffic and a time budget, what absolute Δ can you detect?
- Actionability: Would you ship for that Δ after accounting for engineering, design, and risk?
- Expected value: Roughly estimate upside:
incremental_conversions/day ≈ T * Δ. If that number is small, the test may not justify the effort. - Opportunity cost: Could you instead run a higher-impact experiment that targets a larger effect?
Template you can reuse (fill-in-the-blanks)
Given baseline p = ___, traffic T/day = ___, and time budget D days = ___ (50/50 split):
1) n_per_variant_available = D * (T/2)
2) Δ_feasible ≈ sqrt(16 * p(1-p) / n_per_variant_available)
3) Relative lift feasible ≈ Δ_feasible / p
4) Is that lift worth acting on? If not, change MDE, increase traffic, or pick a bigger intervention.