Hypothesis testing as a structured way to weigh evidence
Hypothesis testing is a decision framework for evaluating whether observed data are sufficiently incompatible with a default explanation. It does not “prove” a claim; it quantifies how surprising the data would be if the default explanation were true, and then applies a pre-chosen decision rule.
Core ingredients
- Null hypothesis (H0): the default model, often “no effect” or “no difference” (e.g., a mean equals a benchmark, two groups have equal proportions).
- Alternative hypothesis (H1 or HA): what you would consider if the null seems incompatible with the data (e.g., the mean is higher, the groups differ).
- Test statistic: a single number computed from the sample that measures how far the data are from what H0 predicts, in standardized units (e.g., z, t).
- p-value: the probability, assuming H0 is true, of observing a test statistic at least as extreme as the one you got (in the direction(s) specified by HA).
- Decision threshold (α): a pre-set cutoff (commonly 0.05) for how small a p-value must be to label results “statistically significant.”
What a p-value is (and is not)
Correct interpretation: A p-value is a measure of compatibility between the observed data and the null model. Smaller p-values indicate the data would be less common if H0 were true.
Common misinterpretations to avoid:
- Not “the probability that H0 is true.” Hypothesis tests do not output P(H0 | data).
- Not “the probability the results happened by chance.” Chance is already built into the probability statement under H0.
- Not “the effect is big.” Statistical significance can occur for tiny effects with large samples.
One-sided vs two-sided alternatives
- Two-sided tests ask whether the parameter differs in either direction (e.g., mean is not equal to 50).
- One-sided tests ask whether it differs in a specific direction (e.g., mean is greater than 50). Use one-sided tests only when the opposite direction would not be acted upon and was not of interest before seeing data.
Step-by-step workflow for a hypothesis test
- State the question in parameter form (mean difference, proportion difference, etc.).
- Write H0 and HA (choose one- vs two-sided intentionally).
- Choose α (decision threshold) before looking at results.
- Check assumptions relevant to the test (e.g., independence, approximate normality for a t-test, adequate counts for a proportion z-test).
- Compute the test statistic (estimate − null value) ÷ standard error under H0.
- Compute the p-value from the reference distribution (t or normal, depending on the test and assumptions).
- Make a decision: if p ≤ α, “reject H0”; otherwise “fail to reject H0.”
- Write an interpretation in context: what the decision implies and what it does not imply. Include an effect size estimate (and ideally a confidence interval) to address magnitude.
Decision errors: Type I, Type II, and power
Type I error (false positive)
A Type I error occurs when you reject H0 even though H0 is true. The probability of a Type I error is controlled by α (e.g., α = 0.05 means that if H0 were true and you repeated the study many times, about 5% of tests would reject H0 just due to sampling variability).
Practical consequence example: A company rolls out a costly new process because a test suggests it improves conversion, but in reality it does not. The cost of the rollout is the cost of a false positive.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Type II error (false negative)
A Type II error occurs when you fail to reject H0 even though HA is true (a real effect exists). Its probability is often denoted β.
Practical consequence example: A hospital fails to detect that a new protocol reduces complications, so it does not adopt it. The missed improvement is the cost of a false negative.
Power (conceptual introduction)
Power is the probability of correctly rejecting H0 when a specific alternative is true: power = 1 − β. Higher power means you are more likely to detect an effect of a given size.
Power increases when:
- Sample size increases (standard errors shrink).
- True effect size is larger (signal is stronger relative to noise).
- Variability is lower (measurements are more consistent).
- α is larger (easier to reject H0, but increases Type I error risk).
- One-sided tests are used appropriately (more power in the specified direction, but only justified when direction is pre-specified and meaningful).
Example 1: One-sample test for a mean (interpretation-focused)
Scenario: A bottling machine is supposed to fill 500 mL on average. You sample n = 25 bottles and observe a sample mean of 503 mL with sample standard deviation s = 6 mL. You want to check whether the machine is overfilling.
1) Set up hypotheses
- H0: μ = 500
- HA: μ > 500 (one-sided; overfilling is the concern)
2) Choose α
Let α = 0.05.
3) Test statistic (one-sample t)
Use a t-test because the population standard deviation is unknown and n is moderate.
t = (x̄ − μ0) / (s / √n) = (503 − 500) / (6 / √25) = 3 / (6/5) = 2.5Degrees of freedom: df = n − 1 = 24.
4) p-value
The p-value is P(T ≥ 2.5) for a t distribution with 24 df. This is about p ≈ 0.01 (exact value depends on a calculator/software).
5) Decision and interpretation
- Decision: Since p ≈ 0.01 ≤ 0.05, reject H0.
- Interpretation: If the true mean fill were 500 mL, getting a sample mean as high as 503 mL (or higher) with this level of variability and sample size would be uncommon (about 1% of the time). This provides evidence that the machine’s mean fill is above 500 mL.
- What this does not say: It does not say there is a 99% chance the machine is overfilling; it says the observed data are not very compatible with μ = 500.
Practical note
Even if statistically significant, you should compare the estimated overfill (about 3 mL) to operational tolerances and cost implications (see the section on practical vs statistical significance).
Example 2: Two-sample comparison of proportions (A/B test)
Scenario: An e-commerce team tests a new checkout design. In the control group, n1 = 1000 users see the old design and x1 = 80 purchase (8.0%). In the treatment group, n2 = 1000 users see the new design and x2 = 100 purchase (10.0%). Is there evidence the conversion rate differs?
1) Set up hypotheses
- H0: p1 = p2
- HA: p1 ≠ p2 (two-sided; any change matters)
2) Choose α
Let α = 0.05.
3) Compute sample proportions
p̂1 = 80/1000 = 0.08 p̂2 = 100/1000 = 0.104) Test statistic (two-proportion z-test)
Under H0, the standard approach uses a pooled proportion:
p̂_pool = (x1 + x2) / (n1 + n2) = (80 + 100) / 2000 = 0.09Standard error under H0:
SE0 = √( p̂_pool(1 − p̂_pool)(1/n1 + 1/n2) )Plug in values:
SE0 = √(0.09 * 0.91 * (1/1000 + 1/1000)) ≈ √(0.0819 * 0.002) ≈ √0.0001638 ≈ 0.0128z statistic:
z = (p̂2 − p̂1) / SE0 = (0.10 − 0.08) / 0.0128 ≈ 1.565) p-value
Two-sided p-value: p = 2·P(Z ≥ 1.56) ≈ 2·0.059 ≈ 0.118.
6) Decision and interpretation
- Decision: Since p ≈ 0.118 > 0.05, fail to reject H0.
- Interpretation: The observed 2 percentage point difference (10% vs 8%) is not especially unusual under a “no difference” model given these sample sizes; the data are fairly compatible with equal conversion rates. This does not show the new design has no effect—it shows the study did not provide strong evidence of a difference at α = 0.05.
- Next step thinking: If a 2-point lift would be valuable, you might need more data (higher power) to detect that magnitude reliably.
Writing results without overstating them
Useful phrasing templates
- When rejecting H0: “Assuming [null model], results this extreme would be unlikely (p = ...). This provides evidence consistent with [alternative direction], with an estimated effect of ...”
- When failing to reject H0: “The data do not provide strong evidence against [null model] at α = .... The estimate is ..., but uncertainty is large enough that both small negative and positive effects remain plausible.”
Avoid these statements
- “We proved the null is false.”
- “There is a 95% chance the alternative is true.”
- “No significance means no effect.”
Practical significance vs statistical significance
Statistical significance answers: “Is the data incompatible with H0 beyond what we’d expect from random sampling variability?”
Practical significance answers: “Is the effect large enough to matter for decisions, costs, safety, or user experience?”
Why they can disagree
- Large samples can make tiny effects statistically significant: With enough data, even a 0.2% conversion lift may yield p < 0.05, yet be too small to justify engineering effort.
- Small samples can miss meaningful effects: A clinically important reduction in complications may not reach p < 0.05 if the study is underpowered.
Decision-focused checklist
- Define a minimum practically important difference (MPID) before analyzing data (e.g., “at least +1.5 percentage points conversion” or “at least −5 mmHg blood pressure”).
- Report an effect size (difference in means, risk difference, relative risk, etc.), not only a p-value.
- Use uncertainty to judge plausibility: Ask whether effects at or above the MPID are plausible given the data (often via a confidence interval, even when the hypothesis test is the headline).
- Consider costs of errors: If Type I errors are expensive (false alarms), choose a smaller α; if Type II errors are expensive (missed improvements), plan for higher power (often via larger n).
Mini-illustration: statistically significant but practically small
Suppose an A/B test with very large samples finds a conversion increase from 8.00% to 8.10% with p = 0.001. The data are strongly incompatible with “no difference,” but the lift is only 0.10 percentage points. Whether to ship depends on expected revenue impact, implementation cost, and potential side effects—not on p alone.
Mini-illustration: practically meaningful but not statistically significant
Suppose a pilot study suggests a 2 percentage point conversion lift (8% to 10%) but yields p = 0.12. The estimate could still be valuable; the evidence is simply not strong enough at α = 0.05. A sensible response might be to run a longer test or reduce measurement noise rather than concluding “no effect.”
Common pitfalls and how to avoid them
- Changing α after seeing results: Decide thresholds in advance to keep Type I error control meaningful.
- Multiple testing without adjustment: Testing many metrics or segments inflates false positives. Plan primary outcomes and limit exploratory claims.
- Stopping early based on p-values: Repeated looks at the data can inflate Type I error unless using appropriate sequential methods.
- Equating “non-significant” with “equivalent”: To argue two options are practically the same, you need an equivalence or non-inferiority framework with a pre-defined margin, not a standard “fail to reject” result.