Variance Reduction Basics for A/B Testing: Getting Clearer Results Faster

Capítulo 9

Estimated reading time: 8 minutes

+ Exercise

1) Why variance reduction helps (and when it’s safe)

Variance reduction methods aim to reduce noise in your estimate without changing the true underlying treatment effect. In practice, that means you get:

  • Narrower confidence intervals for the same sample size, or
  • Higher power (more sensitivity) for the same runtime, or
  • Shorter runtime to reach the same precision.

Conceptually, many A/B outcomes are noisy because users differ a lot (baseline engagement, spend propensity, device constraints, geography, etc.). If you can explain some of that user-to-user variation using information that is not caused by the treatment, you can subtract it out and leave a cleaner comparison.

When variance reduction is safe

  • Safe: using covariates measured before randomization (or otherwise unaffected by treatment), like pre-period activity, device type, country, acquisition channel at signup.
  • Usually safe: stratifying/blocking on stable attributes (platform, geo) that exist at assignment time.
  • Not safe: using variables that can be influenced by treatment (post-assignment behavior) as “controls.” That can bias the estimate.

A helpful rule: variance reduction should change precision, not the estimand. If your method changes what effect you are estimating (e.g., conditioning on post-treatment engagement), you are no longer just reducing variance—you may be introducing bias.

2) Stratification and blocking: improving balance by design

Stratification (often called blocking) means you split traffic into meaningful groups (strata) and randomize within each group. This ensures treatment and control are balanced on key attributes that strongly affect the metric.

Why it reduces variance

If platform or geography drives large differences in outcomes, then random imbalances (e.g., slightly more iOS users in treatment) can add noise. Stratification forces balance within each stratum, reducing that source of variability.

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

Common strata in product and marketing tests

  • Platform: iOS / Android / Web
  • Geography: country, region, language
  • User type: new vs returning, free vs paid
  • Acquisition channel: paid search vs organic (if known at assignment)

Step-by-step: implementing stratified randomization

  1. Pick 1–3 high-impact attributes that are known at assignment time and strongly related to the outcome.
  2. Define strata (e.g., {iOS, Android, Web} × {US, non-US}). Keep the number of strata manageable.
  3. Within each stratum, randomize users into A/B at the desired split (e.g., 50/50).
  4. Analyze with stratum-aware aggregation: compute the treatment effect within each stratum, then combine using weights proportional to stratum size (or pre-specified weights).

Practical notes

  • Too many strata can backfire: tiny strata create operational complexity and unstable within-stratum estimates.
  • Stratify on strong predictors: if an attribute barely relates to the metric, it won’t help much.
  • Blocking vs post-hoc adjustment: blocking improves balance by design; post-hoc adjustment tries to correct imbalance after the fact. Both can help, but blocking is conceptually cleaner.

3) Covariates and pre-period data: adjusting outcomes using baseline behavior

Covariate adjustment uses additional variables (covariates) to explain part of the outcome variation. The most common and powerful covariate is pre-period behavior: what the user did before the experiment started.

Intuition

Many metrics are strongly correlated over time. A user who spent a lot last week is more likely to spend this week. If you compare raw spend during the experiment, you’re mixing treatment effects with baseline differences. If you adjust for baseline spend, you remove predictable variation and isolate the incremental effect more precisely.

Example: baseline adjustment for a continuous metric

Suppose your outcome is Y = purchases during the experiment. You also have X = purchases in the 14 days before the experiment. If X and Y are correlated, adjusting for X reduces residual variance.

Step-by-step: practical covariate adjustment workflow

  1. Choose covariates that are pre-treatment (measured before assignment) and plausibly predictive of the outcome.
  2. Check correlation between covariate(s) and outcome. Higher correlation typically means more variance reduction.
  3. Fit an adjustment model (often a simple linear regression is enough): Y ~ Treatment + X (and optionally other pre-treatment covariates).
  4. Use the treatment coefficient as the adjusted treatment effect estimate.
  5. Pre-specify the adjustment in your analysis plan to avoid “shopping” for covariates.

What covariates work well

  • Pre-period version of the same metric (best when available)
  • Stable user attributes: tenure, plan type, device class
  • Historical engagement: sessions, clicks, watch time in a pre-window

4) Ratio metrics: why they can be noisy, alternatives, and cautions

Ratio metrics are common in product and marketing (e.g., conversion rate = conversions / visitors, ARPU = revenue / users, CTR = clicks / impressions). Ratios are intuitive, but they can be noisy because both numerator and denominator vary, and because user-level denominators can be small or zero.

Why ratios can inflate variance

  • Small denominators create extreme values (e.g., revenue per session when sessions = 1).
  • Heavy tails in the numerator (e.g., a few big purchasers) can dominate.
  • Correlation between numerator and denominator complicates variance; naive approaches can be unstable.

Two common formulations (and why they differ)

FormulationDefinitionTypical behavior
Ratio of sums(Σ numerator) / (Σ denominator)Often more stable; aligns with “overall rate” across traffic
Mean of individual ratiosmean(numerator_i / denominator_i)Can be very noisy if denominators vary; weights users equally regardless of exposure

These are not the same estimand. For example, mean(revenue_i / sessions_i) answers “average user’s revenue per session,” while Σrevenue / Σsessions answers “overall revenue per session across all sessions.” Choose the one that matches the business question.

Variance-reduction-friendly alternatives

  • Prefer ratio of sums when it matches the goal (more stable aggregation).
  • Model numerator with denominator as an offset/exposure (e.g., count models) when appropriate.
  • Use regression adjustment: treat the numerator as outcome and include denominator (or exposure) as a covariate, if that aligns with the estimand.

Caution: don’t “fix” ratio noise by changing the question

Switching from a mean-of-ratios to a ratio-of-sums can reduce variance, but it also changes what you’re estimating. Treat that as a metric definition decision, not merely a statistical trick.

5) CUPED-style intuition: using pre-experiment signal to subtract noise

CUPED (Controlled-experiment Using Pre-Experiment Data) is a widely used variance reduction approach. At a high level, it creates an adjusted outcome by subtracting the part of the outcome that is predictable from pre-period behavior.

Core idea in one line

If X is a pre-period metric correlated with the experiment-period metric Y, define an adjusted metric:

Y_adjusted = Y - θ (X - mean(X))

This keeps the average outcome the same (because mean(X - mean(X)) = 0) but reduces variance when X and Y are correlated.

Step-by-step conceptual walk-through

  1. Pick a pre-period covariate X: ideally the same metric as Y measured before the experiment (e.g., pre-period revenue).
  2. Measure correlation: if users with higher X tend to have higher Y, then X explains some of the noise in Y.
  3. Estimate θ (how much Y moves with X): think of θ as the slope in a simple line predicting Y from X. In practice, θ is estimated from the data (commonly using pooled data across variants, since X is pre-treatment).
  4. Compute adjusted outcomes: for each user, subtract θ (X - mean(X)). Users above-average in baseline get adjusted downward; below-average get adjusted upward.
  5. Run the usual A/B comparison on Y_adjusted instead of Y. The treatment effect estimate targets the same underlying effect, but with lower variance.

Why it works (intuition)

Imagine Y is made of two parts: a predictable baseline component plus random noise plus any treatment effect. CUPED removes part of the predictable baseline component, leaving less leftover randomness to obscure the treatment effect.

Practical tips

  • Use a stable pre-period window (e.g., 1–4 weeks) that reflects typical behavior and is not contaminated by the experiment.
  • Expect bigger gains when correlation is high: if X barely predicts Y, CUPED won’t help much.
  • Pre-specify X and the window to avoid cherry-picking the best-looking adjustment.

6) Warnings: overfitting and leakage (post-treatment covariates)

Variance reduction is powerful, but it’s easy to misuse. Two common failure modes are overfitting and leakage.

Leakage: controlling for post-treatment variables

Leakage happens when you adjust for a variable that is affected by the treatment. This can remove part of the treatment effect (or introduce spurious effects), biasing the estimate.

  • Bad covariate example: “number of sessions during the experiment” as a covariate when the treatment changes engagement. Adjusting for it can mask the true effect on revenue.
  • Bad stratification example: blocking on a label assigned after exposure (e.g., “clicked new feature”)—this is post-treatment selection.

Rule: if the treatment can change it, don’t control for it.

Overfitting: too many covariates or flexible models

Adding many covariates (or using highly flexible models) can create instability, especially with smaller samples or sparse segments. Even when covariates are pre-treatment, an overly complex adjustment can:

  • Increase variance due to estimation noise in the adjustment model
  • Create sensitivity to outliers and rare categories
  • Encourage “analysis degrees of freedom” (trying many models until something looks good)

Practical safeguards

  • Keep adjustment simple: start with 1–3 strong pre-treatment covariates (often just pre-period Y).
  • Pre-register the adjustment: specify covariates, windows, and model form before looking at results.
  • Use the same adjustment for all variants: estimate parameters in a way that does not depend on treatment outcomes differently by group.
  • Audit covariates: for each candidate covariate, explicitly answer: “Can the treatment affect this variable?” If yes, exclude it.

Now answer the exercise about the content:

Which approach is NOT safe for variance reduction in an A/B test because it can bias the treatment effect estimate?

You are right! Congratulations, now go to the next page

You missed! Try again.

Variance reduction should improve precision without changing the estimand. Controlling for post-treatment variables can introduce leakage and remove part of the true treatment effect, biasing the estimate.

Next chapter

Common Traps in A/B Testing: Peeking, Optional Stopping, and Repeated Looks

Arrow Right Icon
Free Ebook cover A/B Testing Essentials: Statistics for Product and Marketing
69%

A/B Testing Essentials: Statistics for Product and Marketing

New course

13 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.