All courses > School Subjects > Statistics ::

Metrics for A/B Testing: Choosing Outcomes That Match Product and Marketing Goals

Capítulo 2

Estimated reading time: 10 minutes

1) Primary vs Secondary Metrics (and Guardrails)

In an A/B test, metrics are not just measurements—they are the decision language of the experiment. A good metric is measurable (you can reliably compute it), interpretable (a change has a clear meaning), and decision-relevant (it maps to a real product or marketing choice).

Primary metric

The primary metric is the single outcome you will use to decide whether to ship the variant. It should be closest to the business objective and least ambiguous. You should have exactly one primary metric per test whenever possible to avoid “cherry-picking” results.

Product example: “Activation rate” for a new onboarding flow.
Marketing example: “Trial start rate” for a landing page experiment.

Secondary (diagnostic) metrics

Secondary metrics help you understand why the primary metric moved (or didn’t). They are not the main decision criterion, but they provide interpretation and debugging.

Example: If trial start rate increases, secondary metrics might include “CTA click-through rate,” “form completion rate,” and “payment step drop-off.”

Guardrail metrics

Guardrails are metrics you monitor to prevent harmful tradeoffs. They define “do no harm” boundaries. A variant can be rejected even if it improves the primary metric if it violates guardrails.

Product guardrails: crash rate, latency, customer support contacts per user, refund rate.
Marketing guardrails: unsubscribe rate, spam complaint rate, lead quality score, chargeback rate.

Step-by-step: selecting a metric set

Write the decision you want to make (e.g., “Ship new pricing page layout?”).
Choose one primary metric that best represents success for that decision.
Add 3–8 secondary metrics that represent the funnel steps or mechanisms.
Add 1–5 guardrails that capture user harm, quality, or operational risk.
Predefine thresholds for guardrails (e.g., “refund rate must not increase by >0.2pp”).

2) Metric Definitions and Event Instrumentation Rules

Most A/B test failures are not statistical—they are measurement failures. A metric must be defined so precisely that two analysts independently implementing it would get the same result.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Define the unit: unique users vs events

Decide whether the metric is computed per user, per session, or per event. In experiments, user-level metrics are often preferred because randomization is typically at the user level.

User-level: “Did the user convert at least once?” (0/1 per user)
Event-level: “How many purchases occurred?” (can be multiple per user)

Rule of thumb: If users can trigger many events, event-level metrics can be dominated by a small number of heavy users. Consider user-level aggregation (e.g., purchases per user) to keep interpretation aligned with the randomized unit.

Attribution windows

An attribution window defines how long after exposure you count outcomes. This is essential in marketing (delayed conversions) and in product (time to adopt a feature).

Example: “Count purchases within 7 days after first exposure to the variant.”
Specify: start time (first exposure vs each exposure), end time (fixed days vs end of test), and timezone.

Deduplication rules

Deduplication prevents counting the same real-world action multiple times due to retries, refreshes, or tracking glitches.

Example: “Deduplicate ‘purchase_completed’ events by order_id; keep the earliest timestamp.”
Example: “Deduplicate ‘lead_submitted’ by (user_id, form_id) within 24 hours.”

Bot and internal traffic filtering

Marketing and landing page tests are especially vulnerable to non-human traffic. Define exclusion rules up front.

Exclude known bot user agents and data center IP ranges.
Exclude internal users (employees, QA accounts) via allowlist/denylist.
Optionally require a minimum engagement signal (e.g., page visible for >3 seconds) for certain analyses—if predefined.

Instrumentation checklist (practical)

Event naming: stable, descriptive (e.g., trial_started, checkout_completed).
Required properties: user_id, timestamp, variant_id, plus domain keys like order_id, plan_type.
Exposure logging: log the first time a user is eligible and sees the variant (avoid “silent” assignment).
Idempotency keys: include unique IDs to deduplicate server-side.
Data validation: compare event counts to backend sources (e.g., payments provider) to detect tracking loss.

Metric specification template

Metric name: Trial Start Rate (7-day)  
Type: Rate (binary per user)  
Unit of analysis: user_id  
Numerator: # users who trigger trial_started within 7 days of first exposure  
Denominator: # exposed eligible users  
Inclusion: new visitors, geo=US/CA, not currently subscribed  
Exclusion: bots, employees, users with missing user_id  
Attribution window: [first_exposure_ts, first_exposure_ts + 7 days]  
Deduplication: trial_started deduped by (user_id) keep earliest  
Source of truth: app event stream + subscription DB reconciliation

3) Metric Types: Rate, Continuous, and Time-to-Event

Rate metrics (conversion rate)

Rate metrics measure the fraction of units that achieve an outcome.

Example: Conversion rate = users who purchase / users exposed
Strengths: easy to interpret; robust for binary outcomes.
Watch-outs: define the denominator carefully (eligible users only), and ensure one conversion per user if that matches the decision.

Practical definition:

purchase_conversion_rate =  
  count_distinct(user_id where purchase_completed within window)  
  / count_distinct(user_id exposed and eligible)

Continuous metrics (revenue per user, average order value)

Continuous metrics measure numeric quantities per unit (often per user). Common examples are revenue per user (RPU), average order value (AOV), or time spent.

Example: RPU = total revenue within 7 days / exposed users
Strengths: captures magnitude, not just occurrence.
Watch-outs: heavy tails (a few users generate huge revenue) can make the metric noisy; define currency, refunds, taxes, and timing.

Practical definition:

rpu_7d =  
  sum(net_revenue within 7 days of first exposure)  
  / count_distinct(exposed eligible users)

Time-to-event metrics (time to activation, time to first purchase)

Time-to-event metrics measure how long it takes for an outcome to occur. They are useful when speed matters (e.g., faster activation) and when not all users will experience the event during the observation window.

Example: “Time from signup to first key action”
Key concept: some users are censored (they haven’t converted yet by the end of the window). Your metric definition must specify how to handle them.

Two common approaches:

Fixed-window completion rate: “Activated within 3 days” (turns time-to-event into a rate metric).
Median time among converters: “Median time to first purchase among users who purchased within 14 days” (be explicit about conditioning).

4) Sensitivity vs Stability: Which Metrics Move and Why

Metric choice is a tradeoff between sensitivity (detects changes quickly) and stability (low noise, resistant to random fluctuations).

Why some metrics are sensitive

Closer to the UI change: CTA click-through often moves more than purchase rate.
Higher frequency: metrics that occur often (clicks) provide more data than rare outcomes (annual plan purchases).
Shorter feedback loop: “Add to cart” responds immediately; “retention at 30 days” takes time.

Why some metrics are stable

Aggregated per user: binary conversion per user is often more stable than raw event counts.
Less influenced by outliers: conversion rate is less affected by one huge purchase than revenue.
Clear eligibility: metrics with well-defined denominators avoid swings from traffic mix changes.

Practical guidance: pairing metrics

Use a value metric as primary when feasible (e.g., RPU), and add funnel rates as secondary to interpret movement.
If the value metric is too noisy, choose a leading indicator primary metric that is strongly linked to value, and keep the value metric as a key secondary metric.

Goal	More sensitive metric (leading)	More stable/value metric (lagging)
Increase purchases	Add-to-cart rate	Purchase conversion rate, RPU
Improve onboarding	Step completion rate	Activation rate, 7-day retention
Improve email performance	Click-through rate	Downstream purchase rate, unsubscribe rate (guardrail)

5) Common Metric Mistakes (and How to Avoid Them)

Mistake: proxy metrics that don’t map to value

Proxy metrics are tempting because they move easily (pageviews, clicks, time on site). But if they don’t reliably connect to user value or business outcomes, they can lead you to ship harmful changes.

Example: Optimizing “time on page” can increase confusion rather than engagement.
Fix: Write the causal story: “If metric X increases, why should value Y increase?” If you can’t explain it, don’t make it primary.

Mistake: changing definitions mid-test

Redefining a metric after seeing results undermines interpretability and can bias decisions.

Example: Switching from “purchase within 7 days” to “purchase within 14 days” because the 7-day result is not significant.
Fix: Freeze metric definitions before launch. If you must add an exploratory metric, label it explicitly as exploratory and avoid using it as the ship criterion.

Mistake: double-counting

Double-counting happens when the same user action is recorded multiple times or when numerator/denominator units don’t match.

Example: Numerator counts purchases (events) while denominator counts users, but you interpret it as a conversion rate.
Example: Counting multiple “trial_started” events per user due to retries.
Fix: Align units (user/user or event/event), and implement deduplication keys (order_id, subscription_id).

Mistake: denominator drift (eligibility not enforced)

If the denominator includes users who could never convert (already subscribed, unsupported geo, logged-out users for an in-app action), the metric becomes less sensitive and less meaningful.

Fix: Define eligibility criteria and enforce them in the denominator query.

Mistake: mixing exposure and outcome timing

Counting outcomes that occurred before exposure (or long after) can dilute or invert effects.

Fix: Use first exposure timestamps and explicit attribution windows; exclude pre-exposure outcomes.

6) Exercises: Convert Business Objectives into Metric Specifications

These exercises are designed to practice turning a goal into a metric that is computable and decision-ready. For each, write: metric name, type, numerator, denominator, inclusion/exclusion, window, and deduplication.

Exercise 1: Landing page redesign to increase trial starts

Business objective: Increase the number of qualified trial starts from paid search traffic without increasing low-quality signups.

Primary metric (spec): Trial Start Rate (7-day, per user)
- Numerator: distinct users with trial_started within 7 days of first exposure
- Denominator: distinct exposed eligible users from paid search campaigns
- Inclusion: utm_source in {google, bing}; new users (no prior account); geo=target markets
- Exclusion: bots; employees; missing user_id; known fraudulent IP ranges
- Deduplication: one trial per user; keep earliest trial_started
Guardrail metric (spec): Lead/Trial Quality Rate
- Numerator: distinct trial users who reach “activated” milestone within 3 days (or pass a qualification rule)
- Denominator: distinct users who started a trial

Exercise 2: Checkout change to increase revenue without harming refunds

Business objective: Increase net revenue per visitor by improving checkout UX, but do not increase refunds.

Primary metric (spec): Net Revenue per User (7-day)
- Numerator: sum(gross_revenue - refunds - chargebacks) within 7 days of first exposure
- Denominator: distinct exposed eligible users who reached checkout entry
- Inclusion: users with checkout_started event; supported payment methods
- Deduplication: revenue deduped by order_id
Guardrail metric (spec): Refund Rate (14-day)
- Numerator: distinct orders refunded within 14 days of purchase
- Denominator: distinct orders completed

Exercise 3: Onboarding experiment to improve activation speed

Business objective: Help new users reach the “Aha moment” faster without reducing overall activation.

Primary metric option A (time-to-event): Median Time to Activation (within 7 days)
- Numerator/Denominator framing: compute time difference per user; summarize by median among users who activate within 7 days
- Inclusion: new signups; first-time users; exclude reactivated accounts
- Censoring rule: users not activated by day 7 are excluded from median calculation (must be stated)
Secondary metric (rate): Activation Rate (7-day)
- Numerator: distinct users who activate within 7 days
- Denominator: distinct exposed eligible new signups
Guardrail: Support Contacts per New User (7-day)
- Numerator: number of support tickets from new users within 7 days
- Denominator: exposed eligible new users

Exercise 4: Email subject line test to increase downstream purchases

Business objective: Improve campaign revenue, not just email engagement, while keeping unsubscribe rates flat.

Primary metric (spec): Purchase Conversion Rate from Email (3-day)
- Numerator: distinct recipients who purchase within 3 days of email send
- Denominator: distinct delivered recipients (exclude bounces)
- Attribution: last-touch = email send; window = 72 hours; specify whether other campaigns reset the clock
Secondary metrics: open rate, click-through rate (diagnostics)
Guardrail: Unsubscribe Rate
- Numerator: distinct recipients who unsubscribe within 7 days of send
- Denominator: distinct delivered recipients

Exercise 5: Write a complete metric spec from scratch

Prompt: You are testing a new “recommended plan” module on the pricing page. The goal is to increase annual plan adoption without decreasing total paid conversions.

Your task: Draft (a) one primary metric, (b) two secondary metrics, and (c) two guardrails. For each metric, specify:

Unit of analysis (user/session/event)
Numerator and denominator (or aggregation rule for continuous metrics)
Eligibility/inclusion criteria (who is in the denominator)
Attribution window (start/end)
Deduplication keys
Exclusions (bots, employees, existing subscribers)

Now answer the exercise about the content:

In an A/B test, when should a variant be rejected even if it improves the primary metric?

You are right! Congratulations, now go to the next page

You missed! Try again.

Guardrail metrics set “do no harm” boundaries. If a variant crosses predefined guardrail thresholds (e.g., higher refund rate), it can be rejected even when the primary metric improves.