1) What “peeking” is and how it inflates false positives
Peeking means checking the experiment results (p-values, confidence intervals, “significance” badges, dashboards) before the planned end of the test and letting those interim looks influence decisions (stop, ship, roll back, change targeting, change metrics, etc.). The trap is not the act of looking—it’s acting as if each look were the only look.
Why it creates more false positives: if you repeatedly ask “Is it significant yet?” you are effectively running multiple tests on the same accumulating data. Even when there is no real effect, random noise will sometimes produce a “significant” result. With enough looks, you will eventually see a lucky fluctuation and may stop right then, mistakenly declaring a win.
Intuition with a simple thought experiment
Imagine a fair coin where the true win rate difference is exactly zero. If you flip it many times and after each batch you ask “Is heads unusually high right now?”, sometimes it will be. If you stop the first time it looks unusually high, you will systematically over-report “heads is better,” even though nothing changed.
In A/B testing terms, repeated looks increase the chance that at least one look crosses your decision threshold by chance. This is why “We checked every morning and it became significant on day 6” is not the same as “We ran for 14 days and evaluated once at the end.”
2) Optional stopping explained with a concrete example timeline of daily checks
Optional stopping is a specific form of peeking: you do not commit to a fixed stopping time or sample size, and instead stop based on the data (e.g., stop when p<0.05, stop when the lift looks big enough, stop when it turns negative, stop when the boss asks).
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Concrete timeline: daily checks that change the error rate
Suppose you planned “about two weeks,” but you also check every day and are tempted to stop early if it looks good.
- Day 1: Dashboard shows +8% lift, p=0.18. You continue.
- Day 2: +6% lift, p=0.11. You continue.
- Day 3: +9% lift, p=0.07. You continue.
- Day 4: +7% lift, p=0.049. You stop and ship.
This feels reasonable, but it is a rule change: you effectively ran “test every day until p<0.05.” Under no true effect, that rule produces more false positives than the nominal 5% you think you’re using, because you gave yourself multiple chances to get lucky.
Two additional ways optional stopping sneaks in
- Stopping for “promising” effect size: “Stop when lift is >5% and p<0.1.” This is still data-dependent stopping.
- Stopping for “no movement”: “Stop early because it’s flat.” If you stop early when results look unexciting, you bias your set of completed tests toward dramatic-looking outcomes (the boring ones get cut short).
3) Practical solutions: fixed-horizon testing, pre-set checkpoints, and sequential methods (conceptually)
A) Fixed-horizon testing (simplest operationally)
Fixed-horizon means you decide the evaluation time (or sample size) up front and make the primary decision once, at the end.
Step-by-step
- Step 1: Pre-commit runtime or sample size. Example: “Run 14 full days” or “Run until 200,000 users are exposed.”
- Step 2: Define the one primary decision point. Example: “Make the ship/no-ship call on Monday 10:00 after day 14 completes.”
- Step 3: Allow monitoring for safety only. You can watch guardrails (errors, latency, complaints) and stop for harm, but do not declare success early based on the primary metric.
- Step 4: At the end, compute the final decision using the planned rule. Treat earlier looks as informational, not confirmatory.
When fixed-horizon is a good fit: most product/marketing tests where you can tolerate waiting for the planned duration and where early shipping is not worth the statistical risk.
B) Pre-set checkpoints (grouped looks with controlled decision rules)
If you must allow earlier decisions, use pre-set checkpoints: a small number of planned interim analyses with a pre-defined rule for what counts as “enough evidence” at each checkpoint.
Step-by-step
- Step 1: Choose checkpoints before launch. Example: after 25%, 50%, 75%, and 100% of planned sample.
- Step 2: Decide what actions are allowed at checkpoints. Example: “Stop early only for strong evidence of benefit” and “stop anytime for clear harm on guardrails.”
- Step 3: Use a method that accounts for multiple looks. Conceptually, you are “spending” your false-positive budget across looks, so early stopping requires stronger evidence than the final look.
- Step 4: Document outcomes at each checkpoint. Record whether you looked, what you saw, and why you continued/stopped.
Key idea: the more often you plan to look, the stricter the early thresholds must be to keep overall false positives under control.
C) Sequential methods (conceptual overview, no heavy math)
Sequential methods are designed for continuous or frequent monitoring while maintaining valid error rates. Instead of pretending each look is independent, they explicitly model the fact that data arrives over time and you may stop early.
Conceptually, sequential approaches provide one of these structures:
- Boundaries: At any time, you can stop for success if evidence crosses an “upper boundary,” stop for harm if it crosses a “lower boundary,” otherwise continue. Early boundaries are typically harder to cross than later ones.
- Error “spending”: You allocate how much false-positive risk you are willing to spend at each look. Spending little early means you require very strong early evidence.
- Always-valid inference: Some approaches produce p-values or intervals that remain valid under continuous monitoring, so you can look often without inflating false positives (as long as you follow the method’s rules).
Operational takeaway: if your organization expects frequent monitoring and early stops, use tooling or statistical support that implements sequential decision rules rather than ad-hoc “stop when it’s significant.”
4) How to communicate interim results responsibly (directional vs confirmatory)
Interim results are often necessary for coordination, but they must be framed correctly. A useful distinction:
- Directional (exploratory) interim read: “What direction is it trending? Any obvious issues?” This is for situational awareness, not for declaring a winner.
- Confirmatory decision: “Are we shipping based on a pre-specified rule that maintains valid error rates?” This is the official call.
Practical communication templates
Directional update (allowed even under fixed-horizon)
- “As of day 5/14, the primary metric is trending +1.2% (unstable; not decision-grade). We are not making a ship decision until the planned end date.”
- “No guardrail issues detected; error rate and latency are within thresholds.”
Checkpoint update (if you planned interim checkpoints)
- “Checkpoint 2/4 reached. Evidence does not meet the pre-set early-stop boundary; continuing to next checkpoint.”
- “Guardrail boundary crossed (harm). Stopping per protocol.”
What to avoid saying
- “It’s significant today, so we’re done” (unless you are using a sequential/checkpoint rule that permits it).
- “It was significant yesterday but not today” (a symptom of repeated looks without a plan; it confuses stakeholders and encourages cherry-picking).
- “Let’s extend it a few more days until it becomes significant” (classic optional stopping).
5) Logging and process controls that prevent accidental rule changes
Many peeking problems are not malicious; they come from ambiguous ownership, flexible dashboards, and undocumented decisions. Process controls reduce “silent” rule changes.
A) Pre-registration (lightweight, internal)
Create a short experiment plan document before launch and treat it as the source of truth. Include:
- Primary metric and decision rule
- Planned runtime/sample size
- Whether interim looks are allowed, and if so, when
- Guardrails and stop-for-harm criteria
- Any segmentation you will (and will not) use for decisions
B) Audit-friendly logging
Log every interim look and every decision. Minimum viable log fields:
- Timestamp of look
- Who looked / who approved actions
- Current sample size/exposures
- Snapshot of key metrics (primary + guardrails)
- Action taken (continue/stop/rollback) and reason
- Protocol deviations (if any) and justification
If your analytics stack supports it, store immutable “analysis snapshots” so later you can reproduce exactly what was seen at the time.
C) Access and dashboard controls
- Separate dashboards: one for guardrails (safety monitoring) and one for decision metrics (locked until checkpoints/end).
- Role-based access: limit who can see confirmatory p-values or decision indicators mid-test.
- Change control: metric definitions, filters, and attribution windows should be versioned; changing them mid-test should require explicit approval and documentation.
D) Operational guardrails against “moving goalposts”
- Do not change the primary metric midstream. If you discover the metric is flawed, stop and restart with a new protocol.
- Do not add “just one more segment” as a decision basis midstream. If you explore segments, label it exploratory and plan a follow-up confirmatory test.
- Do not extend runtime solely because results are close to the threshold. Extensions must be pre-specified or handled via a sequential/checkpoint plan.
6) Checklist for maintaining valid error rates when multiple looks are unavoidable
Use this checklist when you expect repeated looks (daily monitoring, launch pressure, high-stakes changes) and still want statistically valid decisions.
| Item | What to do | Why it matters |
|---|---|---|
| Decide the monitoring model | Choose fixed-horizon, pre-set checkpoints, or a sequential method before launch | Prevents ad-hoc stopping rules that inflate false positives |
| Limit the number of planned looks | Keep interim analyses to a small, scheduled set (e.g., 2–4) | Fewer looks means less adjustment needed and less temptation to cherry-pick |
| Pre-define allowed actions at each look | Specify “stop for harm anytime,” “stop for win only if boundary met,” otherwise continue | Separates safety monitoring from success claims |
| Use appropriate decision thresholds for multiple looks | Adopt checkpoint/sequential boundaries rather than reusing the same threshold each time | Controls overall false-positive rate across looks |
| Lock primary metric and analysis settings | Freeze metric definition, attribution window, inclusion/exclusion rules | Avoids midstream rule changes that invalidate inference |
| Define who can stop the test | Assign a single decision owner (or small committee) and escalation path | Prevents inconsistent decisions driven by whoever checks the dashboard |
| Log every look and decision | Record timestamps, snapshots, and rationale; store reproducible outputs | Creates accountability and makes deviations visible |
| Label interim updates correctly | Use “directional” language until the confirmatory decision point | Prevents stakeholders from treating interim noise as proof |
| Plan what happens if results are ambiguous | Pre-specify: continue to next checkpoint, run full duration, or stop for futility (if supported) | Reduces temptation to extend/stop opportunistically |
| Handle protocol deviations explicitly | If you must deviate, document it and treat results as exploratory; consider rerun | Maintains credibility and avoids overconfident claims |