Hypothesis testing as a decision procedure
In A/B testing, a hypothesis test is not a “truth detector”; it is a decision rule for what you will do (ship, iterate, stop) under uncertainty. You choose an acceptable chance of making a wrong decision, then use the observed data to decide whether the evidence is strong enough to act.
Think of it as a controlled way to answer: “Is the observed lift large enough, and measured precisely enough, that we should treat it as real for decision-making?”
(1) Null and alternative hypotheses in business language
Translate statistical statements into product/marketing decisions
Write hypotheses in terms of the metric you care about (conversion rate, revenue per visitor, retention, etc.) and the direction that matters.
- Null hypothesis (H0): “The change does not improve the metric” (often: no difference). Business meaning: assume the new variant is not better until evidence says otherwise.
- Alternative hypothesis (H1): “The change improves the metric” (or “is different”). Business meaning: we will ship if evidence supports improvement.
One-sided vs two-sided in business terms
- One-sided (directional): H1 says “B is better than A.” Use when only improvement would trigger action and a decrease would be treated similarly to “no ship.”
- Two-sided (non-directional): H1 says “B is different from A.” Use when either an increase or decrease matters (e.g., risk monitoring, guardrails, or when you genuinely would act on either direction).
Example (conversion rate):
- H0:
p_B - p_A = 0(no change) - H1 (one-sided):
p_B - p_A > 0(B improves conversion)
Example (cost per acquisition, where lower is better):
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
- H0:
CPA_B - CPA_A = 0 - H1 (one-sided):
CPA_B - CPA_A < 0
(2) Type I and Type II errors and how they map to product risk
Every hypothesis test trades off two kinds of mistakes. The key is to map them to business consequences.
Type I error (false positive)
Definition: You ship (reject H0) even though the change does not truly help (or is worse).
Probability control: The significance level alpha is the maximum long-run rate of Type I errors you are willing to accept (e.g., 5%).
Product/marketing risk mapping:
- Shipping a change that hurts revenue or retention
- Rolling out a misleading “win” that later regresses
- Wasting engineering and opportunity cost on a non-improvement
Type II error (false negative)
Definition: You do not ship (fail to reject H0) even though the change truly helps.
Probability control: beta is the Type II error rate; power = 1 - beta is the chance you detect a real effect of the size you care about.
Product/marketing risk mapping:
- Missing a real uplift (lost revenue, slower growth)
- Discarding a good idea and moving on prematurely
- Under-investing in improvements because “tests never win”
Choosing which error is more costly
Different contexts imply different tolerances:
- High downside risk (payments, pricing, trust/safety): prioritize avoiding Type I errors → consider smaller
alphaand stronger decision thresholds. - High upside/low downside (small UI tweaks with guardrails): you may accept a higher Type I rate to move faster, but compensate with rollout monitoring.
| Decision | Reality: no improvement | Reality: improvement exists |
|---|---|---|
| Ship / declare win | Type I error (false positive) | Correct decision |
| Don’t ship / no win | Correct decision | Type II error (false negative) |
(3) Relationship between p-values and confidence intervals
A p-value and a confidence interval are two views of the same evidence, as long as they are built from the same model and assumptions.
What a p-value is (in decision terms)
The p-value answers: If there were truly no difference (H0), how surprising would results at least as extreme as what we observed be? Small p-values mean the observed data would be unlikely under H0, so you may choose to reject H0.
How confidence intervals connect to hypothesis tests
For many common A/B test setups:
- A two-sided hypothesis test at level
alpharejects H0 (no difference) iff the (1 − alpha) confidence interval for the effect does not include 0. - A one-sided test at level
alphacorresponds to checking whether a (1 − 2alpha) two-sided interval excludes 0 in the favorable direction, or using a one-sided interval bound.
Practical interpretation: the interval tells you both (a) whether “no effect” is plausible and (b) the range of effects consistent with the data. The p-value compresses that into a single number for the “is there evidence against H0?” question.
Step-by-step: using the interval to make the same call as a test
- Compute the estimated effect (e.g.,
lift = p_B - p_A). - Compute the confidence interval for that effect at the chosen level (e.g., 95%).
- If the interval excludes 0, the result is “statistically significant” at the matching
alpha(two-sided). - Use the interval endpoints to assess whether the effect is large enough to matter (next section).
(4) Significance vs practical importance: avoiding “statistically significant but trivial”
Statistical significance answers “is it likely non-zero?” Practical importance answers “is it worth doing?” You need both.
Why trivial effects become significant
With large sample sizes, even tiny differences can produce very small p-values. That does not mean the change is valuable.
Define a minimum effect that matters
Before running the test, define a minimum practical effect (sometimes called an MDE in planning, or a business threshold in decision rules): the smallest lift that justifies shipping given costs, risks, and opportunity cost.
Examples of practical thresholds:
- Conversion rate: “We only ship if lift is at least +0.3 percentage points.”
- Revenue per visitor: “We need at least +$0.02 per visitor to cover increased incentives.”
- Retention: “We need +0.5% absolute day-7 retention to justify complexity.”
Use the confidence interval to enforce practical importance
Instead of asking only whether 0 is excluded, ask whether the interval supports effects above your threshold.
- Weak practical evidence: interval excludes 0 but includes values below the threshold (could be real but too small).
- Strong practical evidence: interval is entirely above the threshold (consistent with meaningful uplift).
(5) Setting a clear decision rule before looking at results
A good decision rule is explicit, pre-committed, and ties statistical evidence to business action. This reduces “moving the goalposts” after seeing the data.
A template decision rule
Define:
- Primary metric (what you optimize)
- Direction (increase/decrease)
- Alpha (Type I tolerance)
- Practical threshold (minimum effect worth shipping)
- Guardrails (metrics that must not degrade beyond a limit)
Example rule (ship if meaningful and safe):
- Ship variant B if: (a) estimated effect on conversion is >
+0.3pp, and (b) the95%confidence interval forp_B - p_Ais entirely above0, and (c) guardrail metrics (e.g., refund rate, latency) do not worsen beyond pre-set limits.
Step-by-step checklist (pre-results)
- Write H0/H1 in business language (what “no improvement” means).
- Pick alpha based on downside risk of a false win.
- Set the practical threshold (minimum worthwhile effect).
- Choose the evidence criterion: p-value cutoff and/or interval rule.
- Write the ship/no-ship rule in one sentence.
- Decide what you will do in ambiguous cases (e.g., iterate, run longer, or treat as no-ship).
Ambiguous-case policy examples:
- If interval includes 0: do not ship; consider redesign or more data only if the upper bound is promising.
- If interval excludes 0 but does not clear the practical threshold: treat as “not worth shipping,” unless there are strategic reasons (e.g., reduces support cost) captured by another metric.
(6) Examples: how different alpha levels change conclusions (and why it matters)
Example A: same data, different alpha → different decision
Suppose your test yields a p-value of 0.04 for a two-sided test of “no difference.”
- With
alpha = 0.05:0.04 < 0.05→ reject H0 → “statistically significant.” - With
alpha = 0.01:0.04 > 0.01→ fail to reject H0 → “not statistically significant.”
Why this matters: choosing a stricter alpha reduces false wins but increases the chance you miss real improvements. If shipping a false win is costly (e.g., pricing), you may require alpha = 0.01. If iteration speed is critical and downside is limited, 0.05 might be acceptable.
Example B: confidence interval view of the same situation
Assume the estimated lift is +0.20pp and the 95% confidence interval is [+0.01pp, +0.39pp].
- Because the interval excludes 0, it aligns with “significant at 0.05” (two-sided).
- But if your practical threshold is
+0.30pp, the interval includes values below that threshold, so the data do not strongly support a practically meaningful lift.
Decision implication: you might choose “do not ship” even though the result is statistically significant, because the likely lift could be too small to justify the change.
Example C: stricter alpha can prevent a risky rollout
Imagine a high-risk change (e.g., checkout flow). You set alpha = 0.01. The test yields p-value 0.02 and an interval that barely excludes 0 at 95% but would include 0 at 99%.
- At
alpha = 0.05, you might ship. - At
alpha = 0.01, you do not ship because the evidence is not strong enough given the downside risk.
Operational takeaway: alpha is a policy choice about risk tolerance, not a universal constant.
Example D: how alpha interacts with a “ship if effect > threshold” rule
Suppose your rule is: ship if the effect is > +0.3pp and the interval excludes 0. You observe:
- Estimated lift:
+0.35pp 95%CI:[+0.05pp, +0.65pp]99%CI:[−0.02pp, +0.72pp]
Then:
- Under a
95%rule (roughlyalpha=0.05two-sided), you pass “excludes 0” and the point estimate clears the threshold → likely ship (assuming guardrails pass). - Under a
99%rule (roughlyalpha=0.01), the interval includes 0 → you would not ship, even though the point estimate is above the threshold.
Why it matters: stricter alpha demands more certainty that the improvement is real, which is appropriate when the cost of a false win is high. If you want to keep the same alpha but enforce practical importance more strongly, you can instead require the interval to be entirely above the practical threshold (not just above 0).
// Example pre-registered decision rule (two-sided alpha=0.05)\nIF (LowerBound95(effect) > 0) AND (PointEstimate(effect) > practical_threshold) AND (guardrails OK)\nTHEN ship\nELSE do not ship