All courses > School Subjects > Statistics ::

Sample Size Intuition for A/B Testing: Power, Detectable Effects, and Runtime Expectations

Capítulo 8

Estimated reading time: 9 minutes

Power as “reliably detecting a meaningful effect”

In A/B testing, power is the probability that your experiment will flag a real effect of a chosen size (or larger) as “detectable” by your decision rule. Put plainly: if the true lift is actually worth caring about, power tells you how often your test will successfully notice it instead of missing it due to noise.

Power is easiest to internalize with a repeated-world thought experiment:

Assume the true effect is exactly your minimum detectable effect (MDE) (the smallest lift you’d still act on).
Run the same A/B test many times with the same sample size.
Power is the fraction of those runs where you would correctly call a difference.

Common planning targets are 80% power (you detect the MDE 8 times out of 10) or 90% power (9 times out of 10). Higher power reduces missed wins, but costs more sample size and time.

Power vs. “confidence” in planning language

Teams often mix terms. For planning, you typically choose:

Confidence level (how strict you are about false alarms; e.g., 95%).
Power (how often you want to catch a real MDE; e.g., 80%).
MDE (the smallest effect you care to detect).

Those three choices largely determine required sample size.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

What drives required sample size (and why)

Sample size is basically “how much averaging you need” to make signal stand out from noise. The main drivers are:

1) Baseline rate (for conversion metrics)

If your metric is a conversion rate, the baseline conversion probability p affects variability. A rough rule: conversion rates near 50% are the noisiest; very low or very high rates are less variable per user, but low rates make absolute changes tiny and therefore harder to detect in absolute terms.

For a binary conversion, the per-user variance is approximately p(1-p). This appears in most sample size formulas, so baseline matters.

2) Variance (for continuous metrics)

For revenue, time-on-page, or other continuous outcomes, higher variance means you need more users to average out noise. If you can reduce variance (e.g., by better metric definition, trimming extreme outliers, or using variance reduction methods), you reduce required sample size.

3) Minimum effect size (MDE): absolute vs relative

MDE is the effect you plan to be able to detect. Smaller MDEs require much larger samples. The most important intuition:

Required sample size grows roughly with 1/(effect size)².
Halving the MDE (e.g., from 2% relative lift to 1%) requires about 4× the sample size.

Be explicit whether your MDE is:

Absolute: +0.2 percentage points (e.g., 3.0% → 3.2%).
Relative: +6.7% lift (e.g., 3.0% → 3.2%).

Planning is easiest in absolute terms for conversion rates, then translate to relative lift for business discussion.

4) Allocation (A/B split)

For a fixed total sample size, the most statistically efficient split is usually close to 50/50 because it minimizes the variance of the difference between groups. If you skew allocation (e.g., 90/10), you typically need more total users to achieve the same power.

Skewed splits can still be justified (risk management, ramping), but expect longer runtime for the same detectability.

5) Desired confidence and power targets

Stricter false-alarm control (higher confidence) and higher power both increase required sample size. Intuition: you’re demanding stronger evidence and fewer misses, so you need more data to satisfy both.

Rule-of-thumb summary table

Change you make	Effect on required sample size	Why
Lower MDE	Increases sharply (~1/MDE²)	Harder to separate small signal from noise
Higher variance	Increases	More noise to average out
Baseline conversion near 50%	Increases (for conversion metrics)	Higher Bernoulli variance
More imbalanced split	Increases	Less information from the smaller group
Higher confidence / higher power	Increases	Stricter decision requirements

Back-of-the-envelope runtime estimates (without a calculator)

You often need a fast “is this feasible?” estimate before doing a formal power calculation. The goal is not precision; it’s to avoid launching tests that can’t possibly answer the question in time.

A practical approximation for conversion rate tests

For a two-group A/B test on a conversion rate, a commonly used approximation for per-variant sample size is:

n_per_variant ≈ 16 * p(1-p) / (Δ)^2

Where:

p = baseline conversion rate (as a probability, e.g., 0.05)
Δ = absolute MDE (e.g., 0.005 for +0.5 percentage points)

This “16” constant roughly corresponds to planning around 95% confidence and ~80% power. It’s not exact, but it’s good for intuition and quick feasibility checks.

Step-by-step: estimate runtime from traffic

Inputs you need:

Daily eligible users (or sessions) entering the experiment: T
Allocation: typically 50/50, so each variant gets T/2 per day
Baseline conversion rate p
MDE in absolute terms Δ

Steps:

Compute p(1-p).
Compute n_per_variant ≈ 16 * p(1-p) / Δ^2.
Compute days needed: days ≈ n_per_variant / (T/2).

Worked example: signup conversion

Suppose:

T = 100,000 eligible visitors/day
Baseline p = 0.04 (4%)
You care about Δ = 0.004 (0.4 percentage points; 4.0% → 4.4%)

Compute:

p(1-p) = 0.04 * 0.96 = 0.0384
n_per_variant ≈ 16 * 0.0384 / (0.004)^2 = 0.6144 / 0.000016 ≈ 38,400
Per-variant traffic/day = 100,000/2 = 50,000
Runtime ≈ 38,400 / 50,000 ≈ 0.77 days

So this is feasible quickly (though in practice you may still run longer to cover weekday/weekend patterns and operational constraints).

Worked example: tiny lift on a low baseline

Suppose:

T = 50,000 visitors/day
p = 0.01 (1%)
You want to detect Δ = 0.0005 (0.05 percentage points; 1.00% → 1.05%), which is a 5% relative lift

Compute:

p(1-p) = 0.01 * 0.99 = 0.0099
n_per_variant ≈ 16 * 0.0099 / (0.0005)^2 = 0.1584 / 0.00000025 ≈ 633,600
Per-variant traffic/day = 25,000
Runtime ≈ 633,600 / 25,000 ≈ 25.3 days

This is the same core lesson: small absolute effects require a lot of data, even if the relative lift sounds meaningful.

Why very small effects demand huge samples (and what that means for decision speed)

The “square law” is the key intuition: because uncertainty shrinks roughly with the square root of sample size, detecting an effect that is 10× smaller requires about 100× more users.

Consequences for product and marketing decisions:

Timeliness tradeoff: If you need an answer in 1 week, you may have to accept a larger MDE (only detect bigger wins/losses).
Portfolio thinking: If you can only detect large effects quickly, prioritize experiments that plausibly create large effects (new value props, pricing, onboarding changes) over micro-optimizations.
Decision thresholds: If the business value of a tiny lift is real (e.g., at massive scale), you still need to budget the time and opportunity cost to measure it.

Quick scaling rules you can use in meetings

If you reduce MDE from 1.0 pp to 0.5 pp, expect ~4× the sample size and ~4× the runtime (if traffic is fixed).
If traffic doubles, runtime halves (sample size requirement stays the same).
If you move from 50/50 to 80/20 allocation, expect longer runtime (sometimes substantially) for the same MDE.

Practical pitfalls: underpowered vs overpowered tests

Underpowered tests: noisy outcomes and “random winners”

An underpowered test is one where the sample size is too small for the MDE you care about. Typical symptoms:

Results swing dramatically day to day.
Confidence intervals are wide relative to the business impact.
You see many “promising” lifts that disappear on rerun.

Practical guidance:

Before launching, write down the MDE you would act on and sanity-check runtime with a back-of-the-envelope calculation.
If the estimated runtime is longer than you can tolerate, don’t “just run it anyway.” Instead, change something: increase traffic eligibility, simplify variants, accept a larger MDE, or pick a higher-signal metric.

Overpowered tests: statistically detectable but practically irrelevant differences

At very large scale, you can detect extremely tiny differences. The risk is not that the math is wrong; it’s that the organization overreacts to effects that are too small to matter (or are within the range of operational variability).

Common failure modes:

Shipping complexity for a 0.02% relative lift that is smaller than normal week-to-week fluctuation.
Arguing over “significant” differences that are economically negligible.
Optimizing local metrics at the expense of maintainability or user experience.

Practical guidance:

Use an action threshold: decide in advance the smallest effect worth shipping, not just worth detecting.
Pair statistical detectability with a business impact check (e.g., expected incremental conversions or revenue per week).
When you have massive samples, focus on effect size and operational cost, not just detectability.

Scenario drills: estimate feasible detectable lift and decide if it’s worth running

Use these drills to practice the planning loop: (1) pick a time budget, (2) compute feasible sample size from traffic, (3) infer the MDE you can detect, (4) decide whether that MDE is worth it.

Drill 1: “We have 14 days—what lift can we detect?”

Situation:

Eligible traffic: T = 20,000 users/day
Baseline conversion: p = 0.05
Time budget: 14 days
Split: 50/50

Step 1: compute per-variant sample available

Per-variant users/day = 10,000
n_per_variant_available = 14 * 10,000 = 140,000

Step 2: invert the approximation to solve for Δ

From n ≈ 16 * p(1-p) / Δ^2, rearrange:

Δ ≈ sqrt(16 * p(1-p) / n)

Compute:

p(1-p) = 0.05 * 0.95 = 0.0475
Δ ≈ sqrt(16 * 0.0475 / 140,000) = sqrt(0.76 / 140,000) = sqrt(0.0000054286) ≈ 0.00233

Interpretation: In ~14 days, you can detect about 0.23 percentage points absolute (5.00% → 5.23%), which is about 4.6% relative.

Decision question: Would you ship for a ~4–5% relative lift? If yes, the test is feasible. If you only care about 1% relative lift, it’s likely not worth running under this time constraint.

Drill 2: “Is this micro-optimization measurable this month?”

Situation:

T = 60,000 users/day
p = 0.20
You want to detect a 1% relative lift (20.00% → 20.20%)
So Δ = 0.002 absolute

Step 1: estimate required per-variant sample

p(1-p) = 0.2 * 0.8 = 0.16
n_per_variant ≈ 16 * 0.16 / (0.002)^2 = 2.56 / 0.000004 = 640,000

Step 2: convert to runtime

Per-variant users/day = 30,000
Days ≈ 640,000 / 30,000 ≈ 21.3 days

Decision question: If you can afford ~3 weeks and the change is low-risk, it may be feasible. If you need an answer in under a week, you either need a larger MDE (bigger change) or more traffic.

Drill 3: “Low baseline, big relative lift—still hard?”

Situation:

T = 40,000 users/day
p = 0.005 (0.5%)
You want a 10% relative lift (0.50% → 0.55%)
Δ = 0.0005 absolute

Estimate sample size:

p(1-p) ≈ 0.005 * 0.995 ≈ 0.004975
n_per_variant ≈ 16 * 0.004975 / (0.0005)^2 = 0.0796 / 0.00000025 ≈ 318,400

Runtime:

Per-variant users/day = 20,000
Days ≈ 318,400 / 20,000 ≈ 15.9 days

Interpretation: Even a “big” 10% relative lift can take ~2+ weeks when the baseline is very low, because the absolute change is tiny.

Drill 4: “Should we run this test at all?” (worth-it checklist)

Before you invest time:

Feasibility: Using traffic and a time budget, what absolute Δ can you detect?
Actionability: Would you ship for that Δ after accounting for engineering, design, and risk?
Expected value: Roughly estimate upside: incremental_conversions/day ≈ T * Δ. If that number is small, the test may not justify the effort.
Opportunity cost: Could you instead run a higher-impact experiment that targets a larger effect?

Template you can reuse (fill-in-the-blanks)

Given baseline p = ___, traffic T/day = ___, and time budget D days = ___ (50/50 split):
1) n_per_variant_available = D * (T/2)
2) Δ_feasible ≈ sqrt(16 * p(1-p) / n_per_variant_available)
3) Relative lift feasible ≈ Δ_feasible / p
4) Is that lift worth acting on? If not, change MDE, increase traffic, or pick a bigger intervention.

Now answer the exercise about the content:

In A/B test planning, what is the key practical tradeoff when you choose a higher power target (e.g., 90% instead of 80%) while keeping the MDE and confidence level the same?

You are right! Congratulations, now go to the next page

You missed! Try again.

Higher power means you want to catch a real effect (like the MDE) more often. That usually requires more data, increasing required sample size and often runtime, holding confidence and MDE constant.