1) A standardized results narrative (a decision-focused reporting framework)
A/B test results are only useful when they translate into a decision that is defensible under uncertainty. A standardized narrative keeps every readout comparable across teams and prevents “statistical output dumps” that lack context.
Template: the one-page narrative
- Hypothesis: what change is expected, for whom, and why it should move the primary metric.
- Setup: population, variants, allocation, start/end times, exposure definition, and any planned exclusions.
- Metric definitions: primary metric, guardrails, and how they are computed (denominators matter).
- Data quality checks: SRM check, logging completeness, exposure consistency, and outlier handling.
- Effect size: point estimate of lift (absolute and/or relative) and the business translation (e.g., revenue per 1M sessions).
- Uncertainty: interval estimate, probability of harm, and whether results clear decision thresholds.
- Decision: ship/iterate/stop, plus what to do next (rollout plan, follow-up tests, monitoring).
Step-by-step: how to write the narrative from your analysis outputs
- Start with the decision question: “Should we ship variant B to 100%?” or “Should we iterate and re-test?” This prevents drifting into “Is p<0.05?” as the goal.
- State the hypothesis in measurable terms: include the mechanism and the expected direction. Example: “Reducing checkout form fields will increase completed purchases by reducing friction.”
- Describe the setup with operational precision: who was eligible, what counted as exposure, and what time window outcomes were measured in (e.g., conversion within 24h of first exposure).
- Define metrics in one line each: include numerator/denominator and unit of analysis (user, session, account). If there is any ambiguity, add a short formula.
- Summarize data quality checks: report pass/fail and the magnitude of any issues. If a check fails, state the mitigation or why the test is invalid.
- Report effect size first, then uncertainty: lead with “B increased X by Y” and immediately qualify with an interval and risk of harm.
- Make the decision explicit: tie it to pre-agreed thresholds (minimum lift, guardrails, and acceptable risk).
- List follow-ups: what you will monitor post-launch and what you will validate longer-term.
Metric definition examples (copy/paste friendly)
- Primary: Purchase conversion rate = purchasers / exposed users.
- Guardrail: Refund rate = refunds / purchases (measured within 14 days).
- Guardrail: Page load p95 (ms) on checkout page among exposed sessions.
- Secondary diagnostic: Add-to-cart rate = add-to-cart users / exposed users.
Data quality checks to include (minimal set)
- Sample Ratio Mismatch (SRM): observed allocation vs intended allocation; report p-value or thresholded flag.
- Exposure integrity: users see only one variant; cross-over rate.
- Logging completeness: missing events rate by variant; key funnel events present.
- Population consistency: eligibility rules stable; no major traffic source shifts that differentially affect variants.
2) Expressing uncertainty clearly (and avoiding overclaiming)
Decision-makers need uncertainty translated into risk. The goal is not to hide complexity, but to express it in a way that supports action: “What could go wrong if we ship?” and “How confident are we that this is worth it?”
Use intervals as ranges of plausible impact
Report an interval for the effect on the primary metric and translate it into business units. Avoid implying certainty from a single point estimate.
- Good: “Lift is +1.2% (95% interval: +0.1% to +2.3%), which corresponds to +120 to +2,300 additional purchases per 100k exposed users.”
- Avoid: “Variant B increases purchases by 1.2%.” (without uncertainty)
Report probability of harm (risk framing)
Intervals can be reframed into a simple risk statement: the chance the true effect is negative or violates a guardrail. This is especially useful when the interval crosses zero.
- Primary harm risk: P(lift < 0).
- Guardrail risk: P(refund lift > +0.2pp) or P(latency p95 > +50ms).
When you cannot compute a probability directly in your standard tooling, approximate with the interval and a normal approximation, or at minimum state whether the interval includes harmful values.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Separate “statistical uncertainty” from “decision uncertainty”
- Statistical uncertainty: sampling variability (captured by intervals).
- Decision uncertainty: whether the plausible range includes outcomes that are unacceptable or not worth the effort (captured by thresholds and risk tolerance).
Language that prevents overclaiming
| Situation | Prefer | Avoid |
|---|---|---|
| Interval mostly positive | “Results suggest an improvement; plausible lift range is …” | “Proves B is better” |
| Interval crosses zero | “Inconclusive; could be small harm or small benefit” | “No effect” |
| Guardrail risk present | “Potential trade-off; shipping requires mitigation/monitoring” | “Safe” |
| Small but precise lift | “Statistically clear but may be below business threshold” | “Significant, so ship” |
Step-by-step: turning stats into a risk statement
- Write the decision threshold: e.g., “Ship if primary lift ≥ +1.0% and P(harm) ≤ 10% and no guardrail exceeds threshold.”
- Extract the interval: e.g., +0.1% to +2.3%.
- Check threshold coverage: does the lower bound clear +1.0%? If not, the upside exists but is not guaranteed.
- Compute/approximate P(harm): if interval crosses 0, harm risk is non-trivial; quantify if possible.
- Translate to business units: show best-case/base/worst-case within the interval.
- State the action with the risk: “Ship to 25% with monitoring” vs “Iterate and re-test.”
3) Decision tables: ship / iterate / stop (with lift + guardrails)
A decision table makes the implicit explicit. It aligns stakeholders on what “good enough” means before debating interpretations.
Example decision criteria (customize to your org)
- Primary threshold: minimum practical lift (MPL) = +1.0% relative on conversion.
- Risk tolerance: P(primary harm) ≤ 10%.
- Guardrails: refund rate increase ≤ +0.2pp; latency p95 increase ≤ +50ms.
- Operational: no data quality failures; no major launch conflicts.
Decision table (single experiment)
| Outcome | Primary metric | Guardrails | Uncertainty / risk | Decision | Next step |
|---|---|---|---|---|---|
| Clear win | Point estimate ≥ MPL and interval mostly above 0 | All within limits | P(harm) low; downside bounded | Ship | Roll out gradually; monitor key metrics |
| Promising but uncertain | Point estimate ≥ MPL but interval includes < MPL or crosses 0 | Within limits | Meaningful chance effect is small/negative | Iterate | Run follow-up test, improve variant, or extend runtime if pre-planned |
| Trade-off | Primary improves | One guardrail near/over limit | Risk of harm on guardrail | Iterate or Stop | Mitigate guardrail impact; consider segmented rollout |
| Clear loss | Point estimate negative and interval mostly below 0 | May also worsen | High confidence of harm | Stop | Revert; document learnings |
| Invalid test | Any | Any | Data quality failure (e.g., SRM) | No decision | Fix instrumentation/design; re-run |
Step-by-step: building your decision table for a team
- Pick MPL for the primary metric (the smallest lift worth shipping).
- Define guardrail thresholds (what level of degradation is unacceptable).
- Choose a risk tolerance (e.g., max acceptable probability of harm).
- Write the “ship/iterate/stop” rules in one page and get stakeholder sign-off before running tests.
- Use the same table for every readout to reduce debate and bias.
4) Executive-ready summaries vs detailed appendices
Different audiences need different levels of detail. A good readout has a short, decision-ready summary and an appendix that allows auditability.
Executive-ready summary (example format)
Decision: ITERATE (do not ship to 100% yet) | Proposed action: ship to 10% behind a flag + run follow-up test- What we tested: Variant B simplifies checkout (removes 2 optional fields) for all new users.
- Primary result: Conversion lift +1.2% (95% interval: +0.1% to +2.3%).
- Risk: Non-trivial chance the true lift is < +1.0% MPL; small chance of harm (interval includes near-zero).
- Guardrails: Refund rate +0.05pp (within +0.2pp limit); latency +8ms (within +50ms limit).
- Recommendation: Iterate on form validation UX and re-test to confirm lift clears MPL; optionally roll out to 10% to capture upside while monitoring refunds.
- Expected impact if true lift holds: +X purchases/month (range based on interval).
Detailed appendix (what to include)
- Experiment metadata: IDs, dates, targeting, allocation, exposure event, analysis window.
- Population counts: eligible, exposed, analyzed; exclusions with reasons.
- Data quality: SRM results, cross-over rate, missing events, sanity checks.
- Metric definitions: exact SQL/logic references; unit of analysis; deduping rules.
- Results tables: baseline rates, absolute/relative lifts, intervals, p-values if used, and guardrails.
- Sensitivity checks: alternative windows, removing outliers, segment splits (clearly labeled exploratory vs planned).
- Operational notes: incidents, outages, concurrent launches, marketing campaigns.
Appendix results table example
| Metric | Control | Variant | Abs diff | Rel lift | Interval (95%) | Decision threshold |
|---|---|---|---|---|---|---|
| Purchase conversion | 4.80% | 4.86% | +0.06pp | +1.2% | [+0.01pp, +0.11pp] | MPL: +0.05pp |
| Refund rate | 1.10% | 1.15% | +0.05pp | +4.5% | [-0.02pp, +0.12pp] | Max: +0.20pp |
| Checkout p95 latency | 820ms | 828ms | +8ms | +1.0% | [-5ms, +21ms] | Max: +50ms |
5) Post-decision follow-ups: monitoring and longer-term validation
Shipping is not the end of the experiment lifecycle. Post-decision follow-ups protect against regressions, novelty fade, and unanticipated interactions with other changes.
Two layers of follow-up
- Regression monitoring (short-term): detect if the shipped change breaks key metrics or drifts outside expected ranges.
- Longer-term impact validation: confirm the effect persists and does not shift to downstream metrics (e.g., retention, refunds, support tickets).
Step-by-step: a practical monitoring plan after shipping
- Define a monitoring window: e.g., first 72 hours (high alert) + first 2 weeks (stabilization).
- Choose a small set of “pager metrics”: primary metric proxy (faster signal), top guardrails, and system health.
- Set alert thresholds: based on historical variability and acceptable degradation (not “any change”).
- Roll out gradually: 10% → 25% → 50% → 100% with hold points to review metrics.
- Keep a holdout when feasible: a small persistent control can validate that the effect continues under real-world conditions.
- Schedule a post-launch review: compare realized impact vs expected interval; document deltas and hypotheses.
Validating longer-term impact (examples)
- Marketing: short-term CTR lift but check downstream conversion quality (refunds, churn, LTV proxy).
- Product: activation lift but verify retention cohorts and support contact rate.
- Revenue: conversion lift but validate average order value and margin are not harmed.
6) Capstone: complete experiment readout using a running example
Running example: an e-commerce checkout team tested a simplified checkout form (Variant B) against the current form (Control). Goal: increase purchase conversion without increasing refunds or slowing checkout.
Complete experiment readout (copy/paste and adapt)
Hypothesis
Simplifying the checkout form by removing two optional fields will reduce friction for new users, increasing purchase conversion. We expect no meaningful increase in refund rate and no meaningful latency impact.
Setup
- Population: new users (first-time purchasers) on web checkout.
- Variants: Control = current form; Variant B = simplified form (2 optional fields removed, same payment options).
- Allocation: 50/50 user-level randomization.
- Exposure definition: first checkout page view during the experiment window.
- Outcome window: purchase within 24 hours of first exposure; refunds measured within 14 days of purchase.
- Runtime: 14 days (pre-planned).
Metric definitions
- Primary: Purchase conversion rate = purchasers / exposed users.
- Guardrail 1: Refund rate = refunds within 14 days / purchases.
- Guardrail 2: Checkout page latency p95 (ms) among exposed sessions.
- Diagnostic: Add-to-cart rate = add-to-cart users / exposed users.
Data quality checks
- SRM: Passed (observed split 49.9% / 50.1%; no SRM flag).
- Cross-over: <0.1% users saw both variants (acceptable; excluded from analysis).
- Logging: event completeness within normal range; no variant-specific drops detected.
- Incidents: one 20-minute checkout latency incident affected both variants equally (not variant-specific).
Results: effect size and uncertainty
| Metric | Control | Variant B | Effect | Uncertainty summary |
|---|---|---|---|---|
| Purchase conversion | 4.80% | 4.86% | +0.06pp (+1.2% rel) | 95% interval: +0.01pp to +0.11pp |
| Refund rate | 1.10% | 1.15% | +0.05pp | 95% interval: -0.02pp to +0.12pp |
| Checkout latency p95 | 820ms | 828ms | +8ms | 95% interval: -5ms to +21ms |
| Add-to-cart | 12.4% | 12.6% | +0.2pp | Directional support; not decision-driving |
Decision thresholds (pre-agreed)
- MPL (primary): at least +0.05pp absolute (≈ +1.0% relative) on purchase conversion.
- Risk tolerance: acceptable if the plausible downside does not include meaningful harm; prefer P(primary harm) ≤ 10%.
- Guardrails: refund increase ≤ +0.20pp; latency p95 increase ≤ +50ms.
Decision
Iterate (do not full ship yet). The primary metric is positive and clears MPL on the point estimate (+0.06pp), but the lower bound of the interval (+0.01pp) suggests the true lift could be below the minimum practical threshold. Guardrails are within limits, so a limited rollout is feasible if the business values capturing upside while reducing uncertainty.
Recommendation and next actions
- Action: Roll out Variant B to 10–25% behind a feature flag with monitoring on purchase conversion, refund rate, and latency.
- Follow-up experiment: Run a second test with an improved version (e.g., clearer inline validation and optional-field disclosure) aimed at increasing the lower bound above MPL.
- Monitoring plan: 72-hour high-alert dashboard + 2-week stabilization review; alert if refund rate rises > +0.15pp or latency p95 rises > +40ms vs baseline.
- Longer-term validation: evaluate 30-day repeat purchase rate and support contacts for checkout issues to ensure quality is not degraded.
Appendix (what would be attached in the experiment doc)
- Counts by variant (eligible, exposed, analyzed) and exclusion reasons.
- Full metric computation notes and query references.
- Segment breakdowns labeled as exploratory (e.g., device type, traffic source) with caution against over-interpretation.
- Timeline of incidents and concurrent launches.