Hypothesis Tests in A/B Testing: p-values, Error Rates, and Decision Rules

Capítulo 7

Estimated reading time: 9 minutes

+ Exercise

Hypothesis testing as a decision procedure

In A/B testing, a hypothesis test is not a “truth detector”; it is a decision rule for what you will do (ship, iterate, stop) under uncertainty. You choose an acceptable chance of making a wrong decision, then use the observed data to decide whether the evidence is strong enough to act.

Think of it as a controlled way to answer: “Is the observed lift large enough, and measured precisely enough, that we should treat it as real for decision-making?”

(1) Null and alternative hypotheses in business language

Translate statistical statements into product/marketing decisions

Write hypotheses in terms of the metric you care about (conversion rate, revenue per visitor, retention, etc.) and the direction that matters.

  • Null hypothesis (H0): “The change does not improve the metric” (often: no difference). Business meaning: assume the new variant is not better until evidence says otherwise.
  • Alternative hypothesis (H1): “The change improves the metric” (or “is different”). Business meaning: we will ship if evidence supports improvement.

One-sided vs two-sided in business terms

  • One-sided (directional): H1 says “B is better than A.” Use when only improvement would trigger action and a decrease would be treated similarly to “no ship.”
  • Two-sided (non-directional): H1 says “B is different from A.” Use when either an increase or decrease matters (e.g., risk monitoring, guardrails, or when you genuinely would act on either direction).

Example (conversion rate):

  • H0: p_B - p_A = 0 (no change)
  • H1 (one-sided): p_B - p_A > 0 (B improves conversion)

Example (cost per acquisition, where lower is better):

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

  • H0: CPA_B - CPA_A = 0
  • H1 (one-sided): CPA_B - CPA_A < 0

(2) Type I and Type II errors and how they map to product risk

Every hypothesis test trades off two kinds of mistakes. The key is to map them to business consequences.

Type I error (false positive)

Definition: You ship (reject H0) even though the change does not truly help (or is worse).

Probability control: The significance level alpha is the maximum long-run rate of Type I errors you are willing to accept (e.g., 5%).

Product/marketing risk mapping:

  • Shipping a change that hurts revenue or retention
  • Rolling out a misleading “win” that later regresses
  • Wasting engineering and opportunity cost on a non-improvement

Type II error (false negative)

Definition: You do not ship (fail to reject H0) even though the change truly helps.

Probability control: beta is the Type II error rate; power = 1 - beta is the chance you detect a real effect of the size you care about.

Product/marketing risk mapping:

  • Missing a real uplift (lost revenue, slower growth)
  • Discarding a good idea and moving on prematurely
  • Under-investing in improvements because “tests never win”

Choosing which error is more costly

Different contexts imply different tolerances:

  • High downside risk (payments, pricing, trust/safety): prioritize avoiding Type I errors → consider smaller alpha and stronger decision thresholds.
  • High upside/low downside (small UI tweaks with guardrails): you may accept a higher Type I rate to move faster, but compensate with rollout monitoring.
DecisionReality: no improvementReality: improvement exists
Ship / declare winType I error (false positive)Correct decision
Don’t ship / no winCorrect decisionType II error (false negative)

(3) Relationship between p-values and confidence intervals

A p-value and a confidence interval are two views of the same evidence, as long as they are built from the same model and assumptions.

What a p-value is (in decision terms)

The p-value answers: If there were truly no difference (H0), how surprising would results at least as extreme as what we observed be? Small p-values mean the observed data would be unlikely under H0, so you may choose to reject H0.

How confidence intervals connect to hypothesis tests

For many common A/B test setups:

  • A two-sided hypothesis test at level alpha rejects H0 (no difference) iff the (1 − alpha) confidence interval for the effect does not include 0.
  • A one-sided test at level alpha corresponds to checking whether a (1 − 2alpha) two-sided interval excludes 0 in the favorable direction, or using a one-sided interval bound.

Practical interpretation: the interval tells you both (a) whether “no effect” is plausible and (b) the range of effects consistent with the data. The p-value compresses that into a single number for the “is there evidence against H0?” question.

Step-by-step: using the interval to make the same call as a test

  1. Compute the estimated effect (e.g., lift = p_B - p_A).
  2. Compute the confidence interval for that effect at the chosen level (e.g., 95%).
  3. If the interval excludes 0, the result is “statistically significant” at the matching alpha (two-sided).
  4. Use the interval endpoints to assess whether the effect is large enough to matter (next section).

(4) Significance vs practical importance: avoiding “statistically significant but trivial”

Statistical significance answers “is it likely non-zero?” Practical importance answers “is it worth doing?” You need both.

Why trivial effects become significant

With large sample sizes, even tiny differences can produce very small p-values. That does not mean the change is valuable.

Define a minimum effect that matters

Before running the test, define a minimum practical effect (sometimes called an MDE in planning, or a business threshold in decision rules): the smallest lift that justifies shipping given costs, risks, and opportunity cost.

Examples of practical thresholds:

  • Conversion rate: “We only ship if lift is at least +0.3 percentage points.”
  • Revenue per visitor: “We need at least +$0.02 per visitor to cover increased incentives.”
  • Retention: “We need +0.5% absolute day-7 retention to justify complexity.”

Use the confidence interval to enforce practical importance

Instead of asking only whether 0 is excluded, ask whether the interval supports effects above your threshold.

  • Weak practical evidence: interval excludes 0 but includes values below the threshold (could be real but too small).
  • Strong practical evidence: interval is entirely above the threshold (consistent with meaningful uplift).

(5) Setting a clear decision rule before looking at results

A good decision rule is explicit, pre-committed, and ties statistical evidence to business action. This reduces “moving the goalposts” after seeing the data.

A template decision rule

Define:

  • Primary metric (what you optimize)
  • Direction (increase/decrease)
  • Alpha (Type I tolerance)
  • Practical threshold (minimum effect worth shipping)
  • Guardrails (metrics that must not degrade beyond a limit)

Example rule (ship if meaningful and safe):

  • Ship variant B if: (a) estimated effect on conversion is > +0.3pp, and (b) the 95% confidence interval for p_B - p_A is entirely above 0, and (c) guardrail metrics (e.g., refund rate, latency) do not worsen beyond pre-set limits.

Step-by-step checklist (pre-results)

  1. Write H0/H1 in business language (what “no improvement” means).
  2. Pick alpha based on downside risk of a false win.
  3. Set the practical threshold (minimum worthwhile effect).
  4. Choose the evidence criterion: p-value cutoff and/or interval rule.
  5. Write the ship/no-ship rule in one sentence.
  6. Decide what you will do in ambiguous cases (e.g., iterate, run longer, or treat as no-ship).

Ambiguous-case policy examples:

  • If interval includes 0: do not ship; consider redesign or more data only if the upper bound is promising.
  • If interval excludes 0 but does not clear the practical threshold: treat as “not worth shipping,” unless there are strategic reasons (e.g., reduces support cost) captured by another metric.

(6) Examples: how different alpha levels change conclusions (and why it matters)

Example A: same data, different alpha → different decision

Suppose your test yields a p-value of 0.04 for a two-sided test of “no difference.”

  • With alpha = 0.05: 0.04 < 0.05 → reject H0 → “statistically significant.”
  • With alpha = 0.01: 0.04 > 0.01 → fail to reject H0 → “not statistically significant.”

Why this matters: choosing a stricter alpha reduces false wins but increases the chance you miss real improvements. If shipping a false win is costly (e.g., pricing), you may require alpha = 0.01. If iteration speed is critical and downside is limited, 0.05 might be acceptable.

Example B: confidence interval view of the same situation

Assume the estimated lift is +0.20pp and the 95% confidence interval is [+0.01pp, +0.39pp].

  • Because the interval excludes 0, it aligns with “significant at 0.05” (two-sided).
  • But if your practical threshold is +0.30pp, the interval includes values below that threshold, so the data do not strongly support a practically meaningful lift.

Decision implication: you might choose “do not ship” even though the result is statistically significant, because the likely lift could be too small to justify the change.

Example C: stricter alpha can prevent a risky rollout

Imagine a high-risk change (e.g., checkout flow). You set alpha = 0.01. The test yields p-value 0.02 and an interval that barely excludes 0 at 95% but would include 0 at 99%.

  • At alpha = 0.05, you might ship.
  • At alpha = 0.01, you do not ship because the evidence is not strong enough given the downside risk.

Operational takeaway: alpha is a policy choice about risk tolerance, not a universal constant.

Example D: how alpha interacts with a “ship if effect > threshold” rule

Suppose your rule is: ship if the effect is > +0.3pp and the interval excludes 0. You observe:

  • Estimated lift: +0.35pp
  • 95% CI: [+0.05pp, +0.65pp]
  • 99% CI: [−0.02pp, +0.72pp]

Then:

  • Under a 95% rule (roughly alpha=0.05 two-sided), you pass “excludes 0” and the point estimate clears the threshold → likely ship (assuming guardrails pass).
  • Under a 99% rule (roughly alpha=0.01), the interval includes 0 → you would not ship, even though the point estimate is above the threshold.

Why it matters: stricter alpha demands more certainty that the improvement is real, which is appropriate when the cost of a false win is high. If you want to keep the same alpha but enforce practical importance more strongly, you can instead require the interval to be entirely above the practical threshold (not just above 0).

// Example pre-registered decision rule (two-sided alpha=0.05)\nIF (LowerBound95(effect) > 0) AND (PointEstimate(effect) > practical_threshold) AND (guardrails OK)\nTHEN ship\nELSE do not ship

Now answer the exercise about the content:

Which decision best reflects how to combine statistical significance and practical importance when choosing whether to ship an A/B test variant?

You are right! Congratulations, now go to the next page

You missed! Try again.

A good decision rule ties business action to both evidence and value: exclude 0 at the chosen alpha (statistical significance) and ensure the effect is meaningfully large using a practical threshold, while also checking guardrails.

Next chapter

Sample Size Intuition for A/B Testing: Power, Detectable Effects, and Runtime Expectations

Arrow Right Icon
Free Ebook cover A/B Testing Essentials: Statistics for Product and Marketing
54%

A/B Testing Essentials: Statistics for Product and Marketing

New course

13 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.