An A/B test (also called a controlled experiment) is a decision tool: you deliberately create two or more variants of a product or marketing experience, randomly assign comparable units (like users or sessions) to those variants, and measure pre-defined outcome metrics to decide whether to ship, iterate, stop, or roll out a change. In business terms, it answers: “If we change X, what happens to Y, compared to doing nothing?”
1) Core elements of a controlled experiment
Variants (what changes)
Variants are the different versions you compare. The baseline is usually called Control (A), and the change is Treatment (B). You can also run A/B/n tests (A vs B vs C) when comparing multiple treatments, but each additional variant increases complexity and the risk of confusing results if not planned carefully.
- Control (A): current experience (status quo).
- Treatment (B): proposed change (new design, copy, pricing display, onboarding flow, etc.).
- Optional additional treatments (C, D…): alternative ideas, ideally grounded in a clear hypothesis.
Units (who/what gets assigned)
The unit of randomization is the entity that gets assigned to a variant. Common choices:
- User-level: each user always sees the same variant across visits (good for experiences spanning multiple sessions).
- Session-level: each session is randomized (useful for short-lived experiences, but can cause users to see multiple variants over time).
- Account/team-level: for B2B products where multiple users share an account and could influence each other.
Choosing the wrong unit can contaminate results. For example, if you randomize by session but users return, they may see both A and B, which can dilute or distort the measured effect.
Exposure (what it means to “be in” the test)
Exposure defines when a unit is considered to have received the variant. In practice, you need an explicit rule such as:
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
- “A user is exposed when the landing page fully loads.”
- “A session is exposed when the checkout page is rendered.”
- “An account is exposed when any member sees the new pricing table.”
This matters because measurement should be anchored to a consistent exposure event. If you count outcomes for people who were never actually exposed, you weaken the test’s ability to detect real effects.
Outcomes (what you measure)
Outcome metrics are the pre-defined measures used to judge success. In business, these map to decisions: ship/roll out if the treatment improves the primary outcome without unacceptable harm to key guardrails; iterate if results are directionally promising but uncertain; stop if it harms outcomes or fails to justify costs.
- Primary metric: the main success measure tied to the decision (e.g., sign-up conversion rate).
- Secondary metrics: supporting measures that help interpret behavior (e.g., click-through to sign-up form).
- Guardrail metrics: “do no harm” checks (e.g., page load time, refund rate, unsubscribe rate, complaint rate).
Outcomes should be defined precisely, including the denominator and time window. Example: “Sign-up conversion = number of users who complete sign-up within 24 hours of landing page exposure / number of exposed users.”
2) Causal reasoning in practice: why randomization isolates the effect
In business, you want to know whether the change caused the improvement, not merely whether improvement happened around the same time. Randomization is the mechanism that makes causal interpretation credible.
What randomization does
When assignment is random, the groups are expected to be comparable on both:
- Observed factors: device type, traffic source, geography, returning vs new users.
- Unobserved factors: motivation, urgency, brand affinity, willingness to pay.
Because these factors are (on average) balanced between A and B, the main systematic difference between groups is the variant itself. That means differences in outcomes can be attributed to the change rather than to confounding influences.
Why “same time, same audience” is not enough
Even if you compare two experiences during the same week, you can still be misled if people self-select into variants (for example, if the new experience is shown only to mobile users, or only to paid traffic). Randomization prevents this self-selection by design.
Practical checklist for preserving causality
- Random assignment is enforced by the system (not by user choice, not by marketing targeting rules).
- Stable assignment when needed (e.g., user-level bucketing so returning users see the same variant).
- Consistent exposure definition (log an exposure event and tie outcomes to it).
- Avoid interference where one unit’s variant affects another unit’s outcome (common in social features, marketplaces, and team accounts).
3) Translating business questions into experiment questions
Business questions are often broad (“Should we redesign checkout?”). Experiment questions must be specific, measurable, and tied to a decision.
A translation template
Use this structure to convert a business question into an experiment-ready question:
- Change (X): what exactly will be different?
- Population: who is eligible to be randomized?
- Primary outcome (Y): what metric reflects success?
- Decision rule: what result would make you ship/iterate/stop?
- Guardrails: what must not get worse beyond an acceptable threshold?
Examples of translations
| Business question | Experiment question | Primary metric | Decision supported |
|---|---|---|---|
| “Does the new checkout reduce abandonment?” | “Among users who reach checkout, does the new checkout flow reduce checkout abandonment vs current flow?” | Checkout completion rate (or abandonment rate) | Ship new checkout, iterate, or stop |
| “Should we change our pricing page?” | “For new visitors from paid search, does the new pricing table increase trial starts without increasing refunds?” | Trial start rate | Roll out to paid traffic or revert |
| “Is the new email subject line better?” | “For weekly newsletter recipients, does subject line B increase opens and downstream clicks?” | Click-through rate (guardrail: unsubscribe) | Adopt subject line strategy |
Step-by-step: turning a vague idea into a test plan
Write the hypothesis in plain language. Example: “If we simplify the landing page, more visitors will start a sign-up because the value proposition is clearer.”
Specify the exact variant difference. Example: “Replace the hero section with a shorter headline, add 3 bullet benefits, and move the sign-up form above the fold.”
Choose the unit and exposure event. Example: unit = user; exposure = landing page render event.
Define the primary metric and window. Example: “Sign-up completion within 24 hours of exposure.”
Add guardrails. Example: “Page load time must not increase; support contact rate must not increase.”
Pre-commit to decisions. Example: “If sign-up conversion improves and guardrails are stable, roll out; if neutral, iterate on messaging; if worse, stop and revert.”
4) Correlation signals vs experimental evidence
Teams often see a pattern in analytics and assume it implies causation. A/B testing is the discipline of separating signals (correlations) from evidence (causal effects under controlled assignment).
Correlation signals (useful, but not decisive)
Examples of correlation signals:
- “Users who watch the demo video convert more.”
- “Conversion rate increased after we changed the landing page.”
- “Customers from channel X have higher retention.”
These are valuable for generating hypotheses, but they do not prove that the observed factor caused the outcome. For instance, users who watch the demo may already be more motivated; the video might be a marker of intent, not the driver of conversion.
Experimental evidence (decision-grade)
In an A/B test, the question becomes: “If we make more people see the demo video (by changing the experience), does conversion increase compared to not making that change?” Random assignment is what turns a plausible story into a measurable causal claim.
Common pitfalls when mistaking correlation for causation
- Seasonality and timing: conversion rises because of holidays, paydays, or campaigns, not because of the product change.
- Traffic mix shifts: more high-intent traffic arrives (e.g., brand search), inflating conversion independent of the change.
- Regression to the mean: an unusually bad week is followed by a normal week, making any change look like an improvement.
- Selective rollout: launching to a subset (e.g., only iOS) creates systematic differences that can mimic effects.
Use observational data to prioritize what to test; use experiments to decide.
5) End-to-end anchor example: landing page change to increase sign-ups
We will use a single running example to anchor concepts: a SaaS company wants to increase sign-ups by improving the landing page.
The business decision
The team must decide whether to roll out a redesigned landing page to all traffic. The decision options are:
- Ship/Roll out: if sign-ups improve meaningfully and guardrails are acceptable.
- Iterate: if results suggest potential but are uncertain or reveal trade-offs.
- Stop/Revert: if sign-ups drop or guardrails degrade.
Define the controlled experiment
Variants
- A (Control): current landing page with long hero text and sign-up form below the fold.
- B (Treatment): shorter headline + three benefit bullets + sign-up form above the fold + a single testimonial.
Unit
- Randomize at the user level so the same person sees the same page on repeat visits.
Exposure
- Exposure event:
landing_page_viewwhen the page is rendered and the variant is determined.
Outcomes
- Primary metric: sign-up conversion rate within 24 hours of exposure.
- Secondary metrics: click-through to sign-up form, time to start sign-up, bounce rate.
- Guardrails: page load time, error rate, support chat initiation rate.
Step-by-step execution (practical workflow)
Write the experiment question. “Among eligible new visitors, does the redesigned landing page (B) increase completed sign-ups within 24 hours compared to the current page (A)?”
Define eligibility. Example: include new visitors on desktop and mobile web; exclude internal employees and bots; exclude users already logged in.
Implement random assignment. Assign each eligible user to A or B with a fixed split (e.g., 50/50). Store the assignment so it persists across sessions.
Instrument exposure and outcomes. Log
landing_page_viewwithvariant,user_id,timestamp. Logsignup_completedwith the same identifiers, and join outcomes back to the exposure.Pre-define decision thresholds. Example: “Roll out if primary metric improves and guardrails do not worsen beyond agreed limits.” (The exact statistical thresholds and sample sizing are handled in later chapters; here the key is to pre-commit before looking at results.)
Run the test and monitor integrity. Confirm traffic split is as expected, exposure logging is stable, and no major bugs affect only one variant.
Interpret results as a causal comparison. Compare outcomes between A and B for exposed users. If B shows better sign-up conversion and guardrails are stable, the evidence supports rolling out the change.
How this example maps to product and marketing decisions
| Team | Typical question | In this example | Decision |
|---|---|---|---|
| Marketing | “Does this message increase acquisition?” | New headline and benefit bullets | Adopt messaging and scale campaigns |
| Product | “Does this UX change improve activation?” | Form above the fold, simplified layout | Ship UX change or iterate |
| Growth | “Where is the biggest leverage?” | Primary + secondary metrics show where users drop | Prioritize next experiments |
Throughout the course, this landing page experiment will be used to illustrate how to define metrics precisely, reason about causality, and connect experimental results to concrete business actions.