1) Why variance reduction helps (and when it’s safe)
Variance reduction methods aim to reduce noise in your estimate without changing the true underlying treatment effect. In practice, that means you get:
- Narrower confidence intervals for the same sample size, or
- Higher power (more sensitivity) for the same runtime, or
- Shorter runtime to reach the same precision.
Conceptually, many A/B outcomes are noisy because users differ a lot (baseline engagement, spend propensity, device constraints, geography, etc.). If you can explain some of that user-to-user variation using information that is not caused by the treatment, you can subtract it out and leave a cleaner comparison.
When variance reduction is safe
- Safe: using covariates measured before randomization (or otherwise unaffected by treatment), like pre-period activity, device type, country, acquisition channel at signup.
- Usually safe: stratifying/blocking on stable attributes (platform, geo) that exist at assignment time.
- Not safe: using variables that can be influenced by treatment (post-assignment behavior) as “controls.” That can bias the estimate.
A helpful rule: variance reduction should change precision, not the estimand. If your method changes what effect you are estimating (e.g., conditioning on post-treatment engagement), you are no longer just reducing variance—you may be introducing bias.
2) Stratification and blocking: improving balance by design
Stratification (often called blocking) means you split traffic into meaningful groups (strata) and randomize within each group. This ensures treatment and control are balanced on key attributes that strongly affect the metric.
Why it reduces variance
If platform or geography drives large differences in outcomes, then random imbalances (e.g., slightly more iOS users in treatment) can add noise. Stratification forces balance within each stratum, reducing that source of variability.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Common strata in product and marketing tests
- Platform: iOS / Android / Web
- Geography: country, region, language
- User type: new vs returning, free vs paid
- Acquisition channel: paid search vs organic (if known at assignment)
Step-by-step: implementing stratified randomization
- Pick 1–3 high-impact attributes that are known at assignment time and strongly related to the outcome.
- Define strata (e.g., {iOS, Android, Web} × {US, non-US}). Keep the number of strata manageable.
- Within each stratum, randomize users into A/B at the desired split (e.g., 50/50).
- Analyze with stratum-aware aggregation: compute the treatment effect within each stratum, then combine using weights proportional to stratum size (or pre-specified weights).
Practical notes
- Too many strata can backfire: tiny strata create operational complexity and unstable within-stratum estimates.
- Stratify on strong predictors: if an attribute barely relates to the metric, it won’t help much.
- Blocking vs post-hoc adjustment: blocking improves balance by design; post-hoc adjustment tries to correct imbalance after the fact. Both can help, but blocking is conceptually cleaner.
3) Covariates and pre-period data: adjusting outcomes using baseline behavior
Covariate adjustment uses additional variables (covariates) to explain part of the outcome variation. The most common and powerful covariate is pre-period behavior: what the user did before the experiment started.
Intuition
Many metrics are strongly correlated over time. A user who spent a lot last week is more likely to spend this week. If you compare raw spend during the experiment, you’re mixing treatment effects with baseline differences. If you adjust for baseline spend, you remove predictable variation and isolate the incremental effect more precisely.
Example: baseline adjustment for a continuous metric
Suppose your outcome is Y = purchases during the experiment. You also have X = purchases in the 14 days before the experiment. If X and Y are correlated, adjusting for X reduces residual variance.
Step-by-step: practical covariate adjustment workflow
- Choose covariates that are pre-treatment (measured before assignment) and plausibly predictive of the outcome.
- Check correlation between covariate(s) and outcome. Higher correlation typically means more variance reduction.
- Fit an adjustment model (often a simple linear regression is enough):
Y ~ Treatment + X(and optionally other pre-treatment covariates). - Use the treatment coefficient as the adjusted treatment effect estimate.
- Pre-specify the adjustment in your analysis plan to avoid “shopping” for covariates.
What covariates work well
- Pre-period version of the same metric (best when available)
- Stable user attributes: tenure, plan type, device class
- Historical engagement: sessions, clicks, watch time in a pre-window
4) Ratio metrics: why they can be noisy, alternatives, and cautions
Ratio metrics are common in product and marketing (e.g., conversion rate = conversions / visitors, ARPU = revenue / users, CTR = clicks / impressions). Ratios are intuitive, but they can be noisy because both numerator and denominator vary, and because user-level denominators can be small or zero.
Why ratios can inflate variance
- Small denominators create extreme values (e.g., revenue per session when sessions = 1).
- Heavy tails in the numerator (e.g., a few big purchasers) can dominate.
- Correlation between numerator and denominator complicates variance; naive approaches can be unstable.
Two common formulations (and why they differ)
| Formulation | Definition | Typical behavior |
|---|---|---|
| Ratio of sums | (Σ numerator) / (Σ denominator) | Often more stable; aligns with “overall rate” across traffic |
| Mean of individual ratios | mean(numerator_i / denominator_i) | Can be very noisy if denominators vary; weights users equally regardless of exposure |
These are not the same estimand. For example, mean(revenue_i / sessions_i) answers “average user’s revenue per session,” while Σrevenue / Σsessions answers “overall revenue per session across all sessions.” Choose the one that matches the business question.
Variance-reduction-friendly alternatives
- Prefer ratio of sums when it matches the goal (more stable aggregation).
- Model numerator with denominator as an offset/exposure (e.g., count models) when appropriate.
- Use regression adjustment: treat the numerator as outcome and include denominator (or exposure) as a covariate, if that aligns with the estimand.
Caution: don’t “fix” ratio noise by changing the question
Switching from a mean-of-ratios to a ratio-of-sums can reduce variance, but it also changes what you’re estimating. Treat that as a metric definition decision, not merely a statistical trick.
5) CUPED-style intuition: using pre-experiment signal to subtract noise
CUPED (Controlled-experiment Using Pre-Experiment Data) is a widely used variance reduction approach. At a high level, it creates an adjusted outcome by subtracting the part of the outcome that is predictable from pre-period behavior.
Core idea in one line
If X is a pre-period metric correlated with the experiment-period metric Y, define an adjusted metric:
Y_adjusted = Y - θ (X - mean(X))This keeps the average outcome the same (because mean(X - mean(X)) = 0) but reduces variance when X and Y are correlated.
Step-by-step conceptual walk-through
- Pick a pre-period covariate X: ideally the same metric as Y measured before the experiment (e.g., pre-period revenue).
- Measure correlation: if users with higher
Xtend to have higherY, thenXexplains some of the noise inY. - Estimate θ (how much Y moves with X): think of θ as the slope in a simple line predicting
YfromX. In practice, θ is estimated from the data (commonly using pooled data across variants, since X is pre-treatment). - Compute adjusted outcomes: for each user, subtract
θ (X - mean(X)). Users above-average in baseline get adjusted downward; below-average get adjusted upward. - Run the usual A/B comparison on
Y_adjustedinstead ofY. The treatment effect estimate targets the same underlying effect, but with lower variance.
Why it works (intuition)
Imagine Y is made of two parts: a predictable baseline component plus random noise plus any treatment effect. CUPED removes part of the predictable baseline component, leaving less leftover randomness to obscure the treatment effect.
Practical tips
- Use a stable pre-period window (e.g., 1–4 weeks) that reflects typical behavior and is not contaminated by the experiment.
- Expect bigger gains when correlation is high: if
Xbarely predictsY, CUPED won’t help much. - Pre-specify X and the window to avoid cherry-picking the best-looking adjustment.
6) Warnings: overfitting and leakage (post-treatment covariates)
Variance reduction is powerful, but it’s easy to misuse. Two common failure modes are overfitting and leakage.
Leakage: controlling for post-treatment variables
Leakage happens when you adjust for a variable that is affected by the treatment. This can remove part of the treatment effect (or introduce spurious effects), biasing the estimate.
- Bad covariate example: “number of sessions during the experiment” as a covariate when the treatment changes engagement. Adjusting for it can mask the true effect on revenue.
- Bad stratification example: blocking on a label assigned after exposure (e.g., “clicked new feature”)—this is post-treatment selection.
Rule: if the treatment can change it, don’t control for it.
Overfitting: too many covariates or flexible models
Adding many covariates (or using highly flexible models) can create instability, especially with smaller samples or sparse segments. Even when covariates are pre-treatment, an overly complex adjustment can:
- Increase variance due to estimation noise in the adjustment model
- Create sensitivity to outliers and rare categories
- Encourage “analysis degrees of freedom” (trying many models until something looks good)
Practical safeguards
- Keep adjustment simple: start with 1–3 strong pre-treatment covariates (often just pre-period Y).
- Pre-register the adjustment: specify covariates, windows, and model form before looking at results.
- Use the same adjustment for all variants: estimate parameters in a way that does not depend on treatment outcomes differently by group.
- Audit covariates: for each candidate covariate, explicitly answer: “Can the treatment affect this variable?” If yes, exclude it.