What You Are Building in This Capstone
This capstone has two deliverables that must fit together: (1) a Bayesian experiment plan that is specific enough to execute without improvisation, and (2) a one-page decision memo that a busy stakeholder can approve, reject, or revise. The key skill is translation: turning a business decision into a measurable outcome, turning uncertainty into explicit decision criteria, and turning a modeling approach into an operational plan (instrumentation, sampling, monitoring, and governance). You are not proving a theorem or “winning” an A/B test; you are designing a decision process that remains coherent when reality is messy (missing data, seasonality, heterogeneous users, and shifting constraints).
Capstone Scenario Options (Pick One)
Choose one scenario so your plan and memo remain concrete. You can substitute your own, but keep the same structure.
- Product growth: Change onboarding flow to increase activation within 7 days.
- Pricing: Introduce an annual plan discount and measure revenue per visitor and churn.
- Operations: Add a new call-center script and measure resolution time and customer satisfaction.
- Marketing: New ad creative and landing page; measure qualified leads and downstream conversion.
- Education: New lesson format; measure completion and assessment scores.
In the rest of this chapter, examples will reference onboarding activation, but the steps apply to any scenario.
Step 1: Write the Decision in One Sentence (Not the Experiment)
Decision statement: “Should we ship the new onboarding flow to 100% of new users this quarter?” This is intentionally binary and time-bounded. If your real world has more options, list them explicitly (e.g., ship, iterate, roll out to 50%, or abandon). The experiment exists to reduce uncertainty about the consequences of these options.
Decision options checklist
- List the feasible actions you can actually take after the experiment.
- Include a “do nothing / keep current” option as baseline.
- Note any irreversible actions (e.g., pricing changes) that raise the bar for evidence.
Step 2: Define the Outcome as a Decision-Relevant Quantity
Pick one primary outcome that maps directly to value. For onboarding, a common primary outcome is 7-day activation rate. But a decision rarely depends on a single metric; define a small set of guardrails that prevent harmful wins (e.g., activation up but support tickets spike).
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Outcome definition template
- Primary: Activation within 7 days (binary per user).
- Guardrail 1: 7-day retention (binary).
- Guardrail 2: Support contacts per activated user (count/rate).
- Guardrail 3: Time-to-first-value (continuous, possibly skewed).
For each metric, specify: unit of analysis (user, account, session), time window, inclusion/exclusion rules, and how you handle multiple devices or accounts. Ambiguity here is the fastest way to invalidate an experiment.
Step 3: Specify the Estimand and the Comparison
Write down exactly what effect you want to estimate. Example: “Difference in activation probability between new onboarding (B) and current onboarding (A) for new users in the US during the experiment window.” If you expect heterogeneity, define subgroup estimands up front (e.g., mobile vs desktop, paid vs organic). Avoid adding subgroups after seeing results unless you label them exploratory.
Estimand checklist
- Population: who is eligible?
- Treatment: what exactly changes?
- Outcome: how measured and when?
- Summary: difference, ratio, or lift?
- Interference: can one user’s treatment affect another’s outcome (network effects)?
Step 4: Map the Data Generating Process and Threats
Before modeling, list the ways the data can lie to you. This is not “statistics”; it is operational reality. For onboarding, common threats include bot traffic, delayed event logging, users who churn before 7 days (censoring), and changes in acquisition mix during the test.
Threat-to-mitigation table (example)
- Event delay: Add a 48-hour buffer before final reads; monitor event pipeline health.
- Sample ratio mismatch: Automated daily check; pause if allocation deviates beyond tolerance.
- Novelty effect: Track effect by day since exposure; pre-plan a minimum exposure duration.
- Instrumentation change mid-test: Freeze tracking schema; version events; log client build.
Step 5: Choose the Model Family That Matches the Metrics
You have already learned multiple Bayesian modeling patterns; in this capstone you focus on selection and justification, not re-deriving basics. Choose a model per metric and explain why it matches the outcome type and expected quirks.
Example model choices
- Activation (binary): Logistic regression with treatment indicator; optionally include covariates (platform, acquisition channel) to improve precision.
- Support contacts (count/rate): Negative binomial regression with exposure offset (e.g., per activated user or per user-days).
- Time-to-first-value (skewed continuous): Log-normal or gamma regression; consider censoring if not all users reach value within window.
- Multiple segments: Hierarchical treatment effects by segment to stabilize estimates and avoid overreacting to small subgroups.
Write down the minimal set of covariates you will include and why. Covariates should be pre-treatment and reliably measured. If you include many covariates, specify how you will regularize to avoid unstable estimates.
Step 6: Define Decision Criteria in Business Terms
Your experiment plan must state what “good enough” means. Use thresholds that connect to value and risk. For onboarding, you might require a minimum expected gain in activation and also require that the probability of harming retention is small.
Decision criteria example (ship vs iterate)
- Ship if: Expected incremental activated users per 10,000 new users is at least 120 and probability that 7-day retention decreases by more than 0.3 percentage points is below 10%.
- Iterate if: Expected gain is positive but below 120, or uncertainty is too large to rule out meaningful harm.
- Do not ship if: Expected gain is negative or probability of meaningful harm exceeds threshold.
Also define a practical equivalence region (a “no meaningful difference” band) for each metric. This prevents endless testing over tiny effects that do not matter operationally.
Step 7: Plan the Experiment Mechanics (Randomization, Ramp, and Duration)
Write the operational plan as if an engineer and an analyst will execute it without further meetings.
Mechanics checklist
- Randomization unit: user ID (sticky assignment across sessions/devices if possible).
- Allocation: start 10/90 (B/A) for 24 hours, then 50/50 if guardrails stable.
- Eligibility: new users only; exclude internal traffic and known bots.
- Experiment window: start date/time, end date/time, and time zone.
- Exposure definition: user is “exposed” when they see onboarding screen 1; log exposure event.
- Analysis window: outcomes measured for 7 days post-exposure; final read occurs after last user completes window plus buffer.
Include ramp rules: what triggers a pause (e.g., crash rate, severe guardrail breach), who can approve resuming, and where the monitoring dashboard lives.
Step 8: Determine Information Targets (Not Just a Fixed Sample Size)
Instead of only stating “we need N users,” specify an information target: the amount of data needed to make the posterior tight enough around the decision boundary. Practically, you can approximate this by simulating expected uncertainty under plausible effect sizes and traffic volumes, then choosing a minimum duration and a maximum duration.
Practical approach to information targets
- Set a minimum runtime to cover weekly cycles (often 1–2 weeks) and reduce day-of-week confounding.
- Set a maximum runtime to avoid opportunity cost (e.g., 4–6 weeks).
- Define a precision goal such as: “90% credible interval width for activation lift is below 0.4 percentage points” or “probability of crossing ship threshold is above 95%.”
Write down what you will do if you hit maximum runtime without clarity: ship to a small percentage, run a follow-up focused on a key segment, or abandon due to low upside.
Step 9: Pre-Register the Analysis Plan (Lightweight but Real)
Pre-registration here means a time-stamped document in your repo or internal wiki that prevents silent goalpost moves. Keep it short and operational.
Pre-registration contents
- Primary and guardrail metrics with exact definitions.
- Model specification per metric (including covariates and segment structure).
- Decision criteria and thresholds.
- Data exclusions (bots, internal users, logging failures) and how they are detected.
- Stopping and ramp rules (including who decides).
- Planned exploratory analyses (clearly labeled).
Step 10: Build a Simulation-First Workflow to Validate the Plan
Before running the real experiment, simulate data that resembles your expected traffic and baseline rates, then run your planned analysis on the simulated data. The goal is not to predict the true effect; it is to verify that your pipeline, model, and decision criteria behave sensibly under known conditions (no effect, small positive effect, harmful effect).
Simulation checklist
- Generate synthetic users with covariates (platform, channel) and outcomes.
- Inject realistic issues: missing events, delayed logging, segment imbalance.
- Confirm that when the true effect is zero, you rarely “ship.”
- Confirm that when the true effect is meaningfully positive, you often “ship” within your max runtime.
- Confirm that harmful effects trigger “do not ship” or guardrail pauses.
# Pseudocode sketch: simulation-first validation (language-agnostic) for activation metric set baseline_activation = 0.22 set true_lift = 0.005 # +0.5 percentage points set n_users = 50000 for each user i in 1..n_users: platform ~ Categorical({mobile, desktop}) assign treatment ~ Bernoulli(0.5) p = baseline_activation + treatment * true_lift + platform_effect(platform) activated ~ Bernoulli(p) fit planned Bayesian logistic model compute decision quantities: expected_incremental_activations_per_10k prob_retention_harm_gt_threshold (if modeled) apply decision rule and record outcome repeat many times to estimate how often each decision occursThis step often reveals missing definitions (e.g., what is the exposure event?) and brittle logic (e.g., decision rule too aggressive under noise).
Step 11: Draft the One-Page Decision Memo (While the Experiment Runs)
The memo is not a report; it is a decision artifact. Write it early with placeholders so you are forced to clarify what matters before results arrive. Keep it to one page by design: if something does not fit, it is either not essential or needs a better chart.
One-page memo structure (recommended)
- Header: Decision, owner, date, experiment ID, links to dashboard and pre-registration.
- Context (2–3 sentences): Why this decision matters now; what constraint or opportunity triggered it.
- Recommendation (1 sentence): Ship / iterate / do not ship / roll out partially.
- Key numbers (3–6 bullets): Decision-relevant posterior quantities (not a wall of metrics).
- Risks and mitigations (3 bullets): What could go wrong if we act on this; what we will monitor post-decision.
- Appendix mini-chart: One compact plot or table: effect distribution, expected value vs loss, or segment shrinkage summary.
Use language that a stakeholder can act on: “If we ship, we expect about X more activated users per 10,000, with Y% chance the gain is below our minimum threshold.” Avoid technical detours; keep modeling details in a link.
Step 12: Decide What to Show (and What Not to Show)
A one-page memo forces prioritization. Choose visuals that answer the decision question directly. Good options include: a single distribution plot of incremental value, a table of expected value and downside risk, or a small segment chart with partial pooling estimates. Avoid multi-panel dashboards and avoid listing every metric “just in case.”
Example “Key numbers” bullets for onboarding
- Expected impact: +135 activated users per 10,000 new users (posterior mean).
- Chance of meeting minimum upside: 78% probability of at least +120 per 10,000.
- Downside risk: 6% probability activation decreases; 9% probability retention drops by more than 0.3 pp.
- Segment note: Mobile shows higher upside; desktop is near neutral with wide uncertainty.
- Operational note: No guardrail breaches; crash rate unchanged within monitoring tolerance.
These bullets are decision-ready: they map to thresholds and acknowledge uncertainty without drowning the reader.
Step 13: Add an Execution Plan for After the Decision
Stakeholders often ask, “If we ship, what happens next?” Put the answer in the memo as a short operational plan. This reduces fear of uncertainty because it shows you have a monitoring and rollback strategy.
Post-decision plan examples
- If ship: Ramp from 50% to 100% over 3 days; monitor activation, retention, and support contacts daily; rollback if retention harm probability exceeds threshold using updated data.
- If partial rollout: Ship to mobile only; run follow-up for desktop with targeted UX changes.
- If do not ship: Document learnings; queue next iteration; keep instrumentation improvements.
Step 14: Capstone Rubric (Self-Check Before Submission)
Use this rubric to verify your experiment plan and memo are complete and coherent.
Experiment plan rubric
- Decision statement is explicit and time-bounded.
- Primary metric and guardrails are defined with units, windows, and rules.
- Estimand is written clearly (population, treatment, outcome, summary).
- Threats to validity are listed with mitigations.
- Model choices match metric types and include covariate/segment plan.
- Decision criteria include thresholds and practical equivalence bands.
- Mechanics include randomization, ramp, monitoring, and governance.
- Information targets include min/max runtime and what happens if unclear.
- Pre-registration is specified and stored in a known location.
- Simulation-first validation is planned (or already executed) with pass/fail checks.
One-page memo rubric
- Recommendation is one sentence and matches the decision criteria.
- Key numbers are decision-relevant and limited in count.
- Risks are concrete, with mitigations and monitoring plan.
- One compact visual supports the recommendation.
- Links exist to the full analysis, dashboard, and pre-registration.
Capstone Templates You Can Copy
Experiment plan skeleton
Decision: [Ship / iterate / abandon / partial rollout] for [feature] by [date] Primary metric: [definition, unit, window] Guardrails: [definitions] Population & eligibility: [rules] Randomization: [unit, sticky assignment, allocation] Ramp plan: [10/90 -> 50/50 -> 100% with triggers] Data quality checks: [SRM, logging, bots, exclusions] Model plan: [per metric, covariates, segments/hierarchy] Decision criteria: [thresholds, equivalence bands, risk limits] Information targets: [min duration, max duration, precision goal] Pre-registration location: [link/path] Simulation validation: [scenarios, pass/fail expectations] Owners: [engineering, analytics, product] Governance: [who can pause/stop/ship]One-page decision memo skeleton
Decision memo: [Experiment name / ID] Decision owner: [name] Date: [date] Decision: [what we are deciding] Recommendation: [Ship / iterate / do not ship / partial rollout] Why now: [2-3 sentences] Key numbers: - Expected value: [...] - Probability of meeting threshold: [...] - Downside risk: [...] - Guardrails: [...] - Segment note (if relevant): [...] Risks & mitigations: - [...] - [...] - [...] Next steps (post-decision): - [...] - [...] Links: Dashboard: [...] Pre-registration: [...] Full analysis: [...]