All courses > School Subjects > Statistics ::

Randomization and Experiment Integrity in A/B Testing

Capítulo 3

Estimated reading time: 11 minutes

1) Randomization unit choices and their consequences

Randomization is the mechanism that makes groups comparable. In practice, you must decide what entity gets assigned to variant A or B. The unit you randomize determines whether users can “leak” treatment to each other (interference) and whether the same person can end up in multiple variants (contamination).

Common units

User-level (recommended when possible): assign by stable user_id. Minimizes contamination across sessions and devices (if identity is unified). Best for product changes that affect ongoing behavior.
Account-level: assign by account_id (team, household, company). Useful when multiple users share an environment and can influence each other (e.g., collaboration tools). Reduces within-account interference but can reduce effective sample size if accounts are large.
Session-level: assign by session_id. Higher risk of contamination because the same user can see both variants across sessions. Sometimes used for very short-lived experiences (e.g., a one-time landing page visit) but can bias results if users return.
Device-level: assign by device_id/cookie. Can be appropriate for anonymous traffic, but breaks when users switch devices or clear cookies; can also create “ghost users” (multiple identifiers for one person).

Interference vs contamination: how the unit changes the risk

Contamination happens when the same real-world person is exposed to both variants. Interference happens when one unit’s outcome is affected by another unit’s assignment (e.g., network effects). Choosing a larger unit (account instead of user) can reduce interference within that unit, but may increase variance because you have fewer independent units.

Scenario	Risk if randomize by user	Better unit
Family shares one tablet; purchase outcome is household-level	High contamination (shared device) and interference	Household/account if available; otherwise device with caution
Team collaboration feature (mentions, shared docs)	Interference within team; outcomes correlated	Account/team
Anonymous landing page with no login	No user_id available	Device/cookie; consider short time window
Pricing shown in email and on site	Cross-channel spillover	Randomize at user/contact level across channels

Practical steps to choose the unit

Write the causal story: “Variant changes X, which affects outcome Y for entity Z.” Z is your candidate unit.
List ways a person can appear multiple times: multiple devices, multiple accounts, cookie resets, shared devices.
List ways units influence each other: invites, referrals, shared inventory, social feeds, team workflows.
Pick the smallest unit that avoids major interference: smaller units increase sample size, but only if independence is plausible.
Implement stable assignment: hash the chosen ID to a bucket so assignment is deterministic and consistent over time.

2) Allocation ratios (50/50, ramp-ups) and uneven splits

The allocation ratio is the fraction of traffic assigned to each variant. A 50/50 split is statistically efficient for comparing two variants because it maximizes power for a fixed total sample size. But operational constraints often justify uneven splits.

When 50/50 is best

Two-variant comparisons where both variants are safe to expose.
You want the fastest time to detect a given effect size.
Costs and risks are symmetric across variants.

When uneven splits are appropriate

Risk management / ramp-up: start with 1/99, then 5/95, 10/90, 25/75, and finally 50/50 as monitoring confirms no severe regressions.
Capacity constraints: a new backend service can only handle a fraction of traffic.
Learning about a new variant while protecting revenue: keep most traffic on control if the treatment might hurt key outcomes.
Multiple treatments: if you have A/B/C/D, you might allocate more to control for comparisons against each treatment.

Step-by-step: a safe ramp-up plan

Define guardrails: error rate, latency, crash rate, unsubscribe rate, refund rate—metrics that indicate harm quickly.
Choose ramp stages: e.g., 1% → 5% → 10% → 25% → 50%.
Set hold times: keep each stage long enough to observe operational issues (often hours to a day), not necessarily long enough for full statistical inference.
Automate rollback: if guardrails breach thresholds, revert allocation immediately.
Log allocation changes: store timestamps and ratios; later analysis needs to know exposure over time.

Note: Uneven splits reduce statistical efficiency. If you allocate 10/90, you’ll generally need more total traffic to achieve the same sensitivity as 50/50.

3) Independence assumptions and what breaks them

Many A/B test analyses assume that each randomized unit’s outcome is independent of others. Randomization helps balance confounders, but it does not automatically guarantee independence if units interact or are misidentified.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Common independence breakers

Shared devices and identifiers: multiple people use one device; one person uses multiple devices. This can create correlated outcomes and cross-variant exposure.
Network effects: one user’s experience depends on other users (marketplaces, social feeds, messaging). Treatment can change the environment for control users.
Cross-channel spillover: a user is randomized on-site but receives emails, push notifications, or ads that are not aligned with the same assignment.
Inventory and capacity coupling: treatment increases demand and affects availability or prices for everyone.
Geographic or time-based shocks: if assignment correlates with region/time due to implementation, outcomes become correlated within clusters.

Practical mitigations

Use a stable, unified identity (user_id or account_id) and ensure all channels read the same assignment.
Cluster randomize when interference is expected (e.g., randomize by account/team/region), and analyze at that cluster level.
Limit interaction during the experiment (e.g., exclude referral flows or treat them carefully) if feasible.
Measure exposure pathways: log where the user saw the treatment (web, app, email) to detect spillover.

4) Sample Ratio Mismatch (SRM): what it is and how to detect it

Sample Ratio Mismatch (SRM) occurs when the observed split of assigned or observed users differs from the intended allocation ratio beyond what random chance would plausibly produce. SRM is a red flag for broken randomization, logging gaps, or filtering that affects variants differently.

Why SRM matters

It signals selection bias: if some users are more likely to be counted in one variant, the groups may no longer be comparable.
It can invalidate results: even if the outcome difference looks large, the sample may be systematically skewed.
It helps catch implementation bugs early: misconfigured bucketing, caching, bot filtering, or event logging issues.

What causes SRM (common patterns)

Assignment not truly random: hashing wrong ID, using time-based modulo, or non-uniform bucket mapping.
Eligibility filters applied after assignment in a variant-dependent way (e.g., treatment page loads slower, causing more drop-offs before logging).
Instrumentation differences: one variant fails to fire the “exposed” event due to JS errors or blocked resources.
Redirects and caching: CDN caches one variant more, or redirects drop parameters needed for assignment.
Bot or fraud filters that remove traffic unevenly across variants.

Simple SRM check (step-by-step)

Decide what you’re counting: ideally count randomized units at the moment of assignment (e.g., “assigned” event), not only those who successfully rendered a page.
Compute expected counts: for total N and target split p (e.g., 0.5), expected is N·p and N·(1-p).
Compare observed to expected: look at absolute and relative differences.
Run a quick chi-square test for goodness-of-fit to quantify whether deviation is too large for chance.

For two variants A and B with expected proportions p and 1-p, the chi-square statistic is:

chi2 = (obsA - expA)^2 / expA + (obsB - expB)^2 / expB

With 1 degree of freedom, a very small p-value indicates SRM. Many teams use a strict threshold (e.g., p < 0.001) because SRM is often a sign of a real bug, not a subtle statistical fluctuation.

Quick sanity thresholds (rule-of-thumb)

Large experiments: even a 0.5–1% deviation can be statistically significant; treat it seriously.
Small experiments: larger deviations can occur by chance; still investigate if the deviation is persistent over time or concentrated in certain segments.
Segment SRM: check SRM by platform, browser, country, app version, and entry page. Bugs often appear only in one slice.

5) Practical integrity checklist

Experiment integrity is mostly operational. The goal is to ensure that assignment is consistent, exposure is measured correctly, and analysis populations match the causal question.

Assignment consistency

Deterministic bucketing: assignment = hash(unit_id) → bucket → variant. Same unit_id must always map to the same variant.
Sticky across time: avoid re-randomizing on each visit unless session-level is intentionally chosen.
Sticky across surfaces: web/app/email should read the same assignment source of truth.
Versioning: if you change the experiment logic mid-flight, store an experiment version and avoid mixing incompatible assignments.

Logging and event quality

Log assignment separately from exposure: “assigned” (at randomization) and “exposed” (saw the variant) are different events.
Include identifiers: unit_id, experiment_id, variant, timestamp, platform, app version, and a request/session id for debugging.
Monitor missingness: compare event volumes across variants for key instrumentation events (page_view, purchase, error).
Deduplicate: ensure repeated events don’t inflate counts differently across variants.

Exposure vs intent-to-treat (ITT)

Intent-to-treat analyzes everyone as assigned, regardless of whether they fully experienced the treatment. ITT preserves the benefits of randomization and answers: “What is the effect of offering the change?” Exposure-based analysis answers: “What is the effect among those who actually saw it?” Exposure-based estimates are often biased because exposure can be affected by the treatment (e.g., slower load reduces exposure and selects a different subset).

Default to ITT for primary decision-making.
Use exposure analysis as a diagnostic (e.g., to understand adoption), but treat it as potentially biased unless you have a strong design to support it.

Avoiding “ghost users”

“Ghost users” are duplicate or unstable identifiers that inflate sample size and break independence (one person appears as multiple units). Common sources: cookie resets, privacy modes, app reinstalls, device ID resets, and tracking prevention.

Prefer authenticated IDs when available.
Link identities (e.g., map device_id → user_id after login) and decide how to handle pre-login activity.
Stability checks: measure how often unit_id changes for the same person/session; monitor spikes after releases.
Exclude known bots consistently across variants, and log the filtering step.

6) Mini case studies: diagnosing biased results from broken randomization

Case 1: Session-level randomization causes contamination

Setup: A checkout redesign is randomized by session_id because anonymous users are common. Returning users often visit multiple times before purchasing.

Symptom: Treatment appears to increase conversion, but customer support reports confusion: some users see two different checkout flows on different visits.

Diagnosis:

Same user experiences both variants across sessions (contamination).
Users who return multiple times are systematically different (higher intent), and their mixed exposure makes attribution unclear.

Fix (step-by-step):

Switch to user-level randomization for logged-in users; device-level for anonymous.
Make assignment sticky for a defined window (e.g., 30 days) for anonymous device IDs.
Analyze separately: logged-in cohort (clean) vs anonymous cohort (noisier).

Case 2: SRM from variant-dependent logging

Setup: A landing page test is 50/50. Exposure is logged via a client-side event after the page fully loads.

Symptom: Observed split is 47/53 with strong SRM. Treatment also shows worse conversion.

Diagnosis:

Treatment page loads slower; more users bounce before the exposure event fires.
Counting only “exposed” users removes more treatment users, creating SRM and selection bias.

Fix (step-by-step):

Log an “assigned” event server-side at request time.
Keep “exposed” as a secondary diagnostic event.
Use ITT analysis based on assignment, not exposure.
Investigate performance regression separately (it may be the real issue).

Case 3: Cross-channel spillover breaks independence

Setup: Website banner is randomized on-site, but email campaigns are sent to all users with the new messaging (treatment-like copy).

Symptom: No difference between A and B on-site, but overall sales increase during the test window. Segment analysis shows users who opened emails behave similarly in both variants.

Diagnosis:

Email exposure changes behavior for both control and treatment users, diluting the on-site contrast.
Outcomes are no longer independent of assignment because an external channel applies a similar treatment to everyone.

Fix (step-by-step):

Randomize at the user/contact level and apply the same assignment in email and on-site.
Log channel exposures (email open/click, site visit) with the experiment variant.
If alignment isn’t possible, pause overlapping campaigns or treat the test as a multi-channel experiment.

Case 4: Shared devices create hidden interference

Setup: A streaming app tests a new recommendation layout. Randomization uses device_id on smart TVs.

Symptom: Treatment increases watch time but decreases profile-level satisfaction survey scores. Results vary wildly by household size proxies.

Diagnosis:

Multiple household members share one device_id; their preferences conflict.
One member’s interactions shape recommendations for others (interference within device).

Fix (step-by-step):

Randomize by profile_id if available; otherwise by household/account.
Analyze at the chosen unit (profile/household) to match the interference structure.
Add instrumentation to detect multi-user devices (multiple profiles, frequent switching) and treat as a distinct segment.

Case 5: Allocation bug during ramp-up

Setup: Experiment ramps from 10/90 to 50/50. Assignment is computed in two services (edge and origin) with slightly different hashing logic.

Symptom: SRM appears only after ramp-up. Some users flip variants between requests. Metrics show unstable swings day-to-day.

Diagnosis:

Inconsistent assignment across services causes non-sticky exposure and variant flipping.
During ramp-up, the mismatch becomes more visible as more traffic is affected.

Fix (step-by-step):

Centralize assignment in one service or shared library.
Store assignment in a durable place (server-side profile, signed cookie) and reuse it.
Backfill logs with an “assignment_source” field to identify which system assigned each user.
Re-run the experiment after verifying stickiness and SRM checks at each ramp stage.

Now answer the exercise about the content:

In an A/B test, why is a 50/50 allocation ratio typically preferred when both variants are safe to expose?

You are right! Congratulations, now go to the next page

You missed! Try again.

A 50/50 split is usually most statistically efficient for a two-variant test, giving the highest power for a fixed total sample size. Uneven splits can be useful operationally but reduce efficiency.