All courses > School Subjects > Statistics ::

Metric Misuse and Multiple Comparisons in A/B Testing: Avoiding False Discoveries

Capítulo 12

Estimated reading time: 9 minutes

1) Why looking at many metrics increases the chance of finding a “significant” result by luck

When you evaluate many metrics (or many cuts of the same metric), you are effectively running many statistical checks. Even if the variant has no real effect, random noise will occasionally produce an extreme-looking result. The more places you look, the more likely you are to find something that “pops.” This is the multiple comparisons problem (also called multiplicity).

Family-wise error: how false discoveries accumulate

If you use a 5% threshold for each metric, then each metric has a 5% chance of looking “significant” purely by chance when there is no true effect. Across many metrics, the chance that at least one looks significant grows quickly.

A useful approximation (assuming independence) is:

P(at least one false positive) = 1 - (1 - α)^m

Where α is the per-metric false positive rate (e.g., 0.05) and m is the number of metrics tested.

Number of metrics (m)	α per metric	Chance of ≥1 false positive
1	0.05	5.0%
5	0.05	22.6%
10	0.05	40.1%
20	0.05	64.2%
50	0.05	92.3%

In real dashboards, metrics are often correlated (e.g., clicks and sessions), so the exact numbers differ, but the core risk remains: broad searching increases the odds of a misleading “win.”

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

What “metric misuse” looks like in practice

Metric shopping: “The primary metric didn’t move, but this other metric improved—let’s call it a win.”
Over-instrumentation: tracking dozens of KPIs and treating any green arrow as evidence.
Redefining success after seeing results: changing the goalpost to match what happened.

2) Primary metric discipline and pre-committing to success criteria

The most effective defense against false discoveries is discipline: decide what “success” means before the experiment runs, and evaluate the test primarily on that basis.

What to pre-commit (a practical checklist)

Primary metric: the single metric that determines the decision (ship/iterate/stop).
Directionality: is improvement one-sided (only increases matter) or two-sided?
Minimum practical effect: the smallest change worth acting on (business-relevant threshold).
Decision rule: what must be true to ship (e.g., primary metric improves and guardrails are not harmed beyond tolerance).
Population: who is included (new users only? all traffic?) and how exclusions are handled.
Runtime / stopping plan: how long the test runs and what constitutes a valid readout.

Write it down as a “success contract”

Use a short, explicit statement that can be reviewed later:

We will ship Variant B if: (1) Primary metric = checkout conversion rate improves by ≥0.5% relative, (2) payment failure rate does not increase by >0.1 pp, (3) results are evaluated at the planned end date on the full eligible population.

This is less about bureaucracy and more about preventing hindsight-driven decisions.

3) Secondary metrics as supportive evidence, not decision drivers, unless controlled

Secondary metrics are valuable for understanding how a change works and for detecting unintended consequences. The problem arises when they quietly become alternative “ways to win.”

How to use secondary metrics correctly

Interpretation: treat them as context and diagnostics, not as the final decision lever.
Consistency checks: ask whether secondary movements align with the expected mechanism (e.g., if signups rise, do activation steps also rise?).
Guardrails: some secondary metrics are actually constraints (e.g., latency, unsubscribe rate). These can be pre-specified as “must not degrade” thresholds.

When secondary metrics can drive decisions

If you truly want multiple metrics to be decision-relevant, you need to control the overall false discovery risk. Common approaches include:

Bonferroni-style control: allocate a stricter threshold per metric (simple, conservative).
False discovery rate (FDR) control: control the expected proportion of false positives among the metrics you call significant (often more practical for many metrics).

Even without diving into formulas, the operational takeaway is: if multiple metrics can trigger “ship,” you must adjust your statistical criteria accordingly, or you will ship noise more often than you think.

4) Segment analysis pitfalls: cherry-picking and post-hoc stories

Segment analysis (by device, country, acquisition channel, user tenure, etc.) is a common source of false discoveries because it multiplies the number of comparisons and invites storytelling.

Why segments are especially risky

Many cuts: 10 segments × 10 metrics = 100 opportunities for something to look exciting.
Smaller samples: segments often have fewer users, increasing variability and making extreme results more likely.
Human bias: once you see a surprising segment result, it’s easy to rationalize it (“Of course Android users loved it”).

Common segment mistakes

Cherry-picking: reporting only the best-looking segment and ignoring others.
Post-hoc segmentation: inventing a segment after seeing the data (e.g., “users who clicked the banner” when the banner is affected by the variant).
Conditioning on outcomes: defining segments using behaviors influenced by the treatment (can create misleading comparisons).

Pre-specify segments vs exploratory segments

Segments fall into two categories:

Pre-specified segments: chosen before the test because they matter strategically (e.g., new vs returning users). These can be analyzed with a plan and appropriate error control.
Exploratory segments: discovered during analysis. These are hypothesis-generating and should be treated as leads, not proof.

5) Practical approaches: limit hypotheses, interpret exploratory findings cautiously, and use validation tests

Approach A: Limit the number of hypotheses you test

Reduce multiplicity at the source by narrowing what you plan to learn in one experiment.

Step 1: Define one primary metric and a small set of guardrails.
Step 2: List secondary metrics explicitly and label them “diagnostic.”
Step 3: Pre-specify only the segments you are willing to act on.
Step 4: If you have many questions, split them across experiments or run follow-up tests.

Approach B: Use a tiered decision framework

Make it hard for a single noisy metric to drive a decision.

Tier 1 (Decision): primary metric + guardrails.
Tier 2 (Mechanism): a few secondary metrics that should move if the effect is real.
Tier 3 (Exploration): everything else; used to generate hypotheses.

Approach C: Treat exploratory findings as “leads” and validate them

If you discover something interesting post-hoc, the safest path is to validate with a new test designed around that hypothesis.

Step 1: Write the discovered claim as a clear hypothesis (metric + segment + direction).
Step 2: Design a follow-up experiment where that hypothesis is primary (or at least pre-specified).
Step 3: Run it on fresh data (new time window and/or new users), keeping the analysis plan fixed.
Step 4: Compare whether the effect replicates in direction and approximate magnitude.

Approach D: Control multiplicity when you must act on many metrics

Sometimes you genuinely need to evaluate multiple outcomes (e.g., marketplace balance, multi-step funnels, safety metrics). In those cases:

Step 1: Define the “family” of metrics that will be used for decisions (not every dashboard metric).
Step 2: Choose an error-control method appropriate to the situation (conservative family-wise control vs FDR).
Step 3: Document which metrics are in-scope for adjusted inference and which are descriptive only.

6) Example: a dashboard with many metrics where one looks great—how to decide what to trust

Imagine an experiment for a new homepage layout. Your dashboard shows 18 metrics and 12 segments. At the end of the run, the primary metric is flat, but one metric looks amazing.

The dashboard snapshot (simplified)

Metric	Observed change	Notes
Primary: purchase conversion	+0.1% (not compelling)	Decision metric
Guardrail: refund rate	+0.0%	No concern
Secondary: add-to-cart rate	+0.3%	Small
Secondary: email signups	+4.8% (looks “significant”)	Big winner on the dashboard
Secondary: time on site	+1.2%	Ambiguous value
Other: help-center visits	+6.0%	Could be bad

The temptation: declare success because “email signups are up 4.8%.” Here is a disciplined way to evaluate it.

Step-by-step decision process

Step 1: Start with the pre-committed decision rule

Ask: did the primary metric meet the pre-committed success criteria? If not, the default decision should be “do not ship as a win,” even if something else looks good.

Step 2: Check whether the “great” metric was allowed to drive decisions

If email signups were not pre-specified as a decision metric (or part of a controlled multi-metric plan), treat it as supportive evidence only. Otherwise you are effectively changing the rules after seeing the results.

Step 3: Count how many chances you gave yourself to find a winner

You looked at 18 metrics. Even if nothing truly changed, it is plausible that one metric would look unusually strong. This doesn’t prove the signup lift is false, but it lowers your confidence that it’s real without additional support.

Step 4: Look for mechanism consistency

Ask whether related metrics moved in a coherent way:

If email signups rose due to better value communication, did you also see improvements in click-through to the signup module?
Did downstream engagement from those signups improve (e.g., confirmation rate, first email open), or is it just more low-intent signups?
Did help-center visits increase because the new layout confused users (which might also inflate time on site)?

A single isolated spike with no supporting pattern is more suspicious than a cluster of aligned movements.

Step 5: Evaluate segments carefully (avoid post-hoc stories)

Suppose the dashboard also shows: “Email signups +12% on iOS, flat elsewhere.” Before telling a story:

Was iOS a pre-specified segment you planned to act on?
How many segments did you scan to find that one?
Is the iOS sample size large enough to be stable?
Is the segment definition independent of the treatment (not based on behavior affected by the variant)?

If the iOS result was discovered after scanning many segments, treat it as exploratory.

Step 6: Decide what action is justified right now

Given the primary metric is flat, you typically have three reasonable options:

Do not ship; log the signup lift as exploratory: keep the result as a hypothesis for a targeted follow-up.
Ship only if signups are a pre-declared business priority and controlled: requires that signups were part of the success contract (or you used a multi-metric control plan).
Iterate and retest: if the layout seems promising but the primary metric didn’t move, adjust the design and run a new experiment with clearer hypotheses.

Step 7: Validate the “great” metric with a follow-up test

Turn the dashboard surprise into a clean experiment:

New primary metric: email signup conversion (if that’s the real goal).
Guardrails: purchase conversion, help-center visits, unsubscribe rate, complaint rate (as relevant).
Pre-specified segments: e.g., iOS vs Android only if you will act differently by platform.
Success criteria: set a minimum practical lift and a fixed decision rule.

If the signup lift replicates under a pre-committed plan, it becomes much more trustworthy than a one-off dashboard highlight.

Now answer the exercise about the content:

In an A/B test, the primary metric is flat but a secondary metric shows a large “significant” lift after scanning many metrics. What is the most appropriate interpretation and next step?

You are right! Congratulations, now go to the next page

You missed! Try again.

Scanning many metrics increases the chance of false positives. If the secondary metric wasn’t pre-committed as decision-relevant (with error control), it should not override the primary decision rule; validate the finding in a follow-up test with a fixed plan.