All courses > School Subjects > Statistics ::

Common Traps in A/B Testing: Peeking, Optional Stopping, and Repeated Looks

Capítulo 10

Estimated reading time: 9 minutes

1) What “peeking” is and how it inflates false positives

Peeking means checking the experiment results (p-values, confidence intervals, “significance” badges, dashboards) before the planned end of the test and letting those interim looks influence decisions (stop, ship, roll back, change targeting, change metrics, etc.). The trap is not the act of looking—it’s acting as if each look were the only look.

Why it creates more false positives: if you repeatedly ask “Is it significant yet?” you are effectively running multiple tests on the same accumulating data. Even when there is no real effect, random noise will sometimes produce a “significant” result. With enough looks, you will eventually see a lucky fluctuation and may stop right then, mistakenly declaring a win.

Intuition with a simple thought experiment

Imagine a fair coin where the true win rate difference is exactly zero. If you flip it many times and after each batch you ask “Is heads unusually high right now?”, sometimes it will be. If you stop the first time it looks unusually high, you will systematically over-report “heads is better,” even though nothing changed.

In A/B testing terms, repeated looks increase the chance that at least one look crosses your decision threshold by chance. This is why “We checked every morning and it became significant on day 6” is not the same as “We ran for 14 days and evaluated once at the end.”

2) Optional stopping explained with a concrete example timeline of daily checks

Optional stopping is a specific form of peeking: you do not commit to a fixed stopping time or sample size, and instead stop based on the data (e.g., stop when p<0.05, stop when the lift looks big enough, stop when it turns negative, stop when the boss asks).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Concrete timeline: daily checks that change the error rate

Suppose you planned “about two weeks,” but you also check every day and are tempted to stop early if it looks good.

Day 1: Dashboard shows +8% lift, p=0.18. You continue.
Day 2: +6% lift, p=0.11. You continue.
Day 3: +9% lift, p=0.07. You continue.
Day 4: +7% lift, p=0.049. You stop and ship.

This feels reasonable, but it is a rule change: you effectively ran “test every day until p<0.05.” Under no true effect, that rule produces more false positives than the nominal 5% you think you’re using, because you gave yourself multiple chances to get lucky.

Two additional ways optional stopping sneaks in

Stopping for “promising” effect size: “Stop when lift is >5% and p<0.1.” This is still data-dependent stopping.
Stopping for “no movement”: “Stop early because it’s flat.” If you stop early when results look unexciting, you bias your set of completed tests toward dramatic-looking outcomes (the boring ones get cut short).

3) Practical solutions: fixed-horizon testing, pre-set checkpoints, and sequential methods (conceptually)

A) Fixed-horizon testing (simplest operationally)

Fixed-horizon means you decide the evaluation time (or sample size) up front and make the primary decision once, at the end.

Step-by-step

Step 1: Pre-commit runtime or sample size. Example: “Run 14 full days” or “Run until 200,000 users are exposed.”
Step 2: Define the one primary decision point. Example: “Make the ship/no-ship call on Monday 10:00 after day 14 completes.”
Step 3: Allow monitoring for safety only. You can watch guardrails (errors, latency, complaints) and stop for harm, but do not declare success early based on the primary metric.
Step 4: At the end, compute the final decision using the planned rule. Treat earlier looks as informational, not confirmatory.

When fixed-horizon is a good fit: most product/marketing tests where you can tolerate waiting for the planned duration and where early shipping is not worth the statistical risk.

B) Pre-set checkpoints (grouped looks with controlled decision rules)

If you must allow earlier decisions, use pre-set checkpoints: a small number of planned interim analyses with a pre-defined rule for what counts as “enough evidence” at each checkpoint.

Step-by-step

Step 1: Choose checkpoints before launch. Example: after 25%, 50%, 75%, and 100% of planned sample.
Step 2: Decide what actions are allowed at checkpoints. Example: “Stop early only for strong evidence of benefit” and “stop anytime for clear harm on guardrails.”
Step 3: Use a method that accounts for multiple looks. Conceptually, you are “spending” your false-positive budget across looks, so early stopping requires stronger evidence than the final look.
Step 4: Document outcomes at each checkpoint. Record whether you looked, what you saw, and why you continued/stopped.

Key idea: the more often you plan to look, the stricter the early thresholds must be to keep overall false positives under control.

C) Sequential methods (conceptual overview, no heavy math)

Sequential methods are designed for continuous or frequent monitoring while maintaining valid error rates. Instead of pretending each look is independent, they explicitly model the fact that data arrives over time and you may stop early.

Conceptually, sequential approaches provide one of these structures:

Boundaries: At any time, you can stop for success if evidence crosses an “upper boundary,” stop for harm if it crosses a “lower boundary,” otherwise continue. Early boundaries are typically harder to cross than later ones.
Error “spending”: You allocate how much false-positive risk you are willing to spend at each look. Spending little early means you require very strong early evidence.
Always-valid inference: Some approaches produce p-values or intervals that remain valid under continuous monitoring, so you can look often without inflating false positives (as long as you follow the method’s rules).

Operational takeaway: if your organization expects frequent monitoring and early stops, use tooling or statistical support that implements sequential decision rules rather than ad-hoc “stop when it’s significant.”

4) How to communicate interim results responsibly (directional vs confirmatory)

Interim results are often necessary for coordination, but they must be framed correctly. A useful distinction:

Directional (exploratory) interim read: “What direction is it trending? Any obvious issues?” This is for situational awareness, not for declaring a winner.
Confirmatory decision: “Are we shipping based on a pre-specified rule that maintains valid error rates?” This is the official call.

Practical communication templates

Directional update (allowed even under fixed-horizon)

“As of day 5/14, the primary metric is trending +1.2% (unstable; not decision-grade). We are not making a ship decision until the planned end date.”
“No guardrail issues detected; error rate and latency are within thresholds.”

Checkpoint update (if you planned interim checkpoints)

“Checkpoint 2/4 reached. Evidence does not meet the pre-set early-stop boundary; continuing to next checkpoint.”
“Guardrail boundary crossed (harm). Stopping per protocol.”

What to avoid saying

“It’s significant today, so we’re done” (unless you are using a sequential/checkpoint rule that permits it).
“It was significant yesterday but not today” (a symptom of repeated looks without a plan; it confuses stakeholders and encourages cherry-picking).
“Let’s extend it a few more days until it becomes significant” (classic optional stopping).

5) Logging and process controls that prevent accidental rule changes

Many peeking problems are not malicious; they come from ambiguous ownership, flexible dashboards, and undocumented decisions. Process controls reduce “silent” rule changes.

A) Pre-registration (lightweight, internal)

Create a short experiment plan document before launch and treat it as the source of truth. Include:

Primary metric and decision rule
Planned runtime/sample size
Whether interim looks are allowed, and if so, when
Guardrails and stop-for-harm criteria
Any segmentation you will (and will not) use for decisions

B) Audit-friendly logging

Log every interim look and every decision. Minimum viable log fields:

Timestamp of look
Who looked / who approved actions
Current sample size/exposures
Snapshot of key metrics (primary + guardrails)
Action taken (continue/stop/rollback) and reason
Protocol deviations (if any) and justification

If your analytics stack supports it, store immutable “analysis snapshots” so later you can reproduce exactly what was seen at the time.

C) Access and dashboard controls

Separate dashboards: one for guardrails (safety monitoring) and one for decision metrics (locked until checkpoints/end).
Role-based access: limit who can see confirmatory p-values or decision indicators mid-test.
Change control: metric definitions, filters, and attribution windows should be versioned; changing them mid-test should require explicit approval and documentation.

D) Operational guardrails against “moving goalposts”

Do not change the primary metric midstream. If you discover the metric is flawed, stop and restart with a new protocol.
Do not add “just one more segment” as a decision basis midstream. If you explore segments, label it exploratory and plan a follow-up confirmatory test.
Do not extend runtime solely because results are close to the threshold. Extensions must be pre-specified or handled via a sequential/checkpoint plan.

6) Checklist for maintaining valid error rates when multiple looks are unavoidable

Use this checklist when you expect repeated looks (daily monitoring, launch pressure, high-stakes changes) and still want statistically valid decisions.

Item	What to do	Why it matters
Decide the monitoring model	Choose fixed-horizon, pre-set checkpoints, or a sequential method before launch	Prevents ad-hoc stopping rules that inflate false positives
Limit the number of planned looks	Keep interim analyses to a small, scheduled set (e.g., 2–4)	Fewer looks means less adjustment needed and less temptation to cherry-pick
Pre-define allowed actions at each look	Specify “stop for harm anytime,” “stop for win only if boundary met,” otherwise continue	Separates safety monitoring from success claims
Use appropriate decision thresholds for multiple looks	Adopt checkpoint/sequential boundaries rather than reusing the same threshold each time	Controls overall false-positive rate across looks
Lock primary metric and analysis settings	Freeze metric definition, attribution window, inclusion/exclusion rules	Avoids midstream rule changes that invalidate inference
Define who can stop the test	Assign a single decision owner (or small committee) and escalation path	Prevents inconsistent decisions driven by whoever checks the dashboard
Log every look and decision	Record timestamps, snapshots, and rationale; store reproducible outputs	Creates accountability and makes deviations visible
Label interim updates correctly	Use “directional” language until the confirmatory decision point	Prevents stakeholders from treating interim noise as proof
Plan what happens if results are ambiguous	Pre-specify: continue to next checkpoint, run full duration, or stop for futility (if supported)	Reduces temptation to extend/stop opportunistically
Handle protocol deviations explicitly	If you must deviate, document it and treat results as exploratory; consider rerun	Maintains credibility and avoids overconfident claims

Now answer the exercise about the content:

Why does stopping an A/B test the first time the p-value drops below 0.05 during daily checks increase the false-positive rate?

You are right! Congratulations, now go to the next page

You missed! Try again.

Repeated interim looks are like running multiple tests on the same accumulating data. If you stop the first time p<0.05, you give yourself many chances to get a lucky fluctuation, which increases false positives.