Compute base measures at the correct grain
Most KPI mistakes come from calculating on the wrong “grain” (the level of detail of the rows you are aggregating). Before writing any KPI query, decide the grain you want for the base measures (for example: per order, per user-day, per session, per product-day). Then aggregate from that grain upward (daily, weekly, monthly, by channel, etc.).
Base measures you will reuse
- Revenue: usually the sum of paid amounts (often from payments) or net item revenue (from order items). Define whether it is gross, net, includes tax/shipping, includes refunds, etc.
- Orders: usually count of distinct orders that meet a status rule (e.g., completed/paid).
- Active users: usually count of distinct users with at least one qualifying event in a time window (e.g., login, session, purchase).
COUNT vs COUNT(DISTINCT): what you are really counting
COUNT(*) counts rows. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique values (excluding NULL). The key question is: are your rows already at the entity grain you want to count?
If your dataset is at the order grain (one row per order), then COUNT(*) is “number of orders.” If your dataset is at the order-item grain (multiple rows per order), then COUNT(*) is “number of order items,” and you need COUNT(DISTINCT order_id) to count orders.
-- Example: order_items has multiple rows per order_id (one per item line). Count orders vs items per day. SELECT CAST(created_at AS DATE) AS order_date, COUNT(*) AS item_lines, COUNT(DISTINCT order_id) AS orders FROM order_items GROUP BY 1 ORDER BY 1;When DISTINCT is necessary (and when it hides join problems)
DISTINCT is necessary when your current row grain is more detailed than the entity you’re counting (e.g., counting users from an events table, counting orders from order_items). But COUNT(DISTINCT ...) can also hide duplication caused by joins. If a join accidentally multiplies rows, COUNT(DISTINCT order_id) might still look correct while SUM(revenue) becomes inflated.
A practical rule: if you need both counts and sums, don’t “fix” the query with DISTINCT at the end. Instead, compute base measures in separate subqueries at their natural grain, then join those aggregated results.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
-- Risky pattern: DISTINCT makes order counts look fine, but revenue may be duplicated. SELECT CAST(o.created_at AS DATE) AS order_date, COUNT(DISTINCT o.order_id) AS orders, SUM(p.amount) AS revenue FROM orders o JOIN payments p ON p.order_id = o.order_id JOIN order_items oi ON oi.order_id = o.order_id -- multiplies payment rows if multiple items WHERE o.status = 'completed' GROUP BY 1;-- Safer pattern: aggregate each fact at its own grain, then combine. WITH orders_by_day AS ( SELECT CAST(created_at AS DATE) AS order_date, COUNT(*) AS orders FROM orders WHERE status = 'completed' GROUP BY 1 ), payments_by_day AS ( SELECT CAST(paid_at AS DATE) AS pay_date, SUM(amount) AS revenue FROM payments WHERE status = 'captured' GROUP BY 1 ) SELECT o.order_date, o.orders, COALESCE(p.revenue, 0) AS revenue FROM orders_by_day o LEFT JOIN payments_by_day p ON p.pay_date = o.order_date ORDER BY 1;SUM, AVG, MIN, MAX: what they mean at different grains
SUM adds values across rows. It is only correct if each row represents a non-overlapping contribution to the total (for example, one payment row per captured payment, or one order_item row per item line). If joins duplicate rows, SUM inflates immediately.
AVG is the average of row values. It is not the same as “total divided by total” unless each row represents the same unit. For example, AVG(order_total) is average order value only if there is one row per order. If you compute AVG(price) on order_items, you are averaging item prices, not order values.
MIN/MAX are useful for QA and sanity checks (earliest/latest date, smallest/largest order, min/max daily revenue). They can also reveal unexpected outliers or data gaps.
-- QA checks: daily revenue range and order total range. SELECT CAST(paid_at AS DATE) AS pay_date, SUM(amount) AS revenue, MIN(amount) AS min_payment, MAX(amount) AS max_payment FROM payments WHERE status = 'captured' GROUP BY 1 ORDER BY 1;Validate totals before building KPIs
Before layering derived KPIs, validate that your base measures match known totals (finance dashboards, billing system exports, or a trusted report). Do this validation at multiple levels: overall totals, by day, and by a key dimension (channel, country, product category). If totals don’t match, fix the grain/join/filter logic first.
Step-by-step validation workflow
- Step 1: Validate row counts at the base table (e.g., number of captured payments, number of completed orders).
- Step 2: Validate totals without joins (e.g., total captured revenue from payments only).
- Step 3: Validate totals by time (daily/weekly sums should add up to the overall total).
- Step 4: Add one join at a time and re-check totals to detect where duplication appears.
-- Step 2: total revenue without joins. SELECT SUM(amount) AS total_revenue FROM payments WHERE status = 'captured';-- Step 3: daily revenue; sum of these should match total_revenue. SELECT CAST(paid_at AS DATE) AS pay_date, SUM(amount) AS revenue FROM payments WHERE status = 'captured' GROUP BY 1 ORDER BY 1;Common KPI base measures: revenue, orders, active users
Compute each base measure at a clear grain, then aggregate to the reporting grain (often day). Below are common patterns.
-- Revenue per day (payments grain). SELECT CAST(paid_at AS DATE) AS dt, SUM(amount) AS revenue FROM payments WHERE status = 'captured' GROUP BY 1;-- Orders per day (orders grain). SELECT CAST(created_at AS DATE) AS dt, COUNT(*) AS orders FROM orders WHERE status = 'completed' GROUP BY 1;-- Active users per day (events grain). SELECT CAST(event_time AS DATE) AS dt, COUNT(DISTINCT user_id) AS active_users FROM events WHERE event_name IN ('session_start', 'login') GROUP BY 1;Notice how COUNT(DISTINCT user_id) is appropriate for active users because events are typically many-per-user-per-day. If you instead had a user_day_activity table with one row per user per day, you could use COUNT(*) for active users.
Layer derived KPIs with careful numerator/denominator logic
Derived KPIs are ratios or combinations of base measures. The safest approach is: (1) compute each base measure in its own subquery at the correct grain, (2) join those aggregated results on the shared keys (like date), (3) compute the KPI using explicit numerator/denominator logic, including handling divide-by-zero.
AOV (Average Order Value): total revenue / total orders
AOV is a classic example where AVG can be misleading if you’re not at the order grain. The robust definition is:
- Numerator: total revenue in the period
- Denominator: number of orders in the same period
WITH revenue_by_day AS ( SELECT CAST(paid_at AS DATE) AS dt, SUM(amount) AS revenue FROM payments WHERE status = 'captured' GROUP BY 1 ), orders_by_day AS ( SELECT CAST(created_at AS DATE) AS dt, COUNT(*) AS orders FROM orders WHERE status = 'completed' GROUP BY 1 ) SELECT o.dt, o.orders, COALESCE(r.revenue, 0) AS revenue, CASE WHEN o.orders = 0 THEN NULL ELSE COALESCE(r.revenue, 0) * 1.0 / o.orders END AS aov FROM orders_by_day o LEFT JOIN revenue_by_day r ON r.dt = o.dt ORDER BY 1;Why not AVG(order_total)? It can be fine if you have a reliable order_total column and exactly one row per order, but it becomes fragile if order totals are computed via joins to items, discounts, shipping lines, refunds, etc. The “sum divided by count” approach makes the numerator and denominator explicit and easier to validate.
Conversion rate: orders / sessions (or purchasers / visitors)
Conversion rate depends on your definition of “opportunity” (sessions, visitors, product views) and your definition of “conversion” (orders, paid orders, first purchase). The key is aligning the time window and grain for numerator and denominator.
- Numerator: orders (or purchasers) in the period
- Denominator: sessions (or visitors) in the same period
WITH sessions_by_day AS ( SELECT CAST(event_time AS DATE) AS dt, COUNT(DISTINCT session_id) AS sessions FROM events WHERE event_name = 'session_start' GROUP BY 1 ), orders_by_day AS ( SELECT CAST(created_at AS DATE) AS dt, COUNT(*) AS orders FROM orders WHERE status = 'completed' GROUP BY 1 ) SELECT s.dt, s.sessions, COALESCE(o.orders, 0) AS orders, CASE WHEN s.sessions = 0 THEN NULL ELSE COALESCE(o.orders, 0) * 1.0 / s.sessions END AS conversion_rate FROM sessions_by_day s LEFT JOIN orders_by_day o ON o.dt = s.dt ORDER BY 1;If you instead join events directly to orders at the row level, you can easily multiply orders by the number of events per session/user. Aggregating first prevents that.
ARPU / Revenue per active user: revenue / active users
This KPI is sensitive to definitions. “Active users” must be defined consistently (which events count, and whether bots/internal users are excluded). Then compute:
- Numerator: revenue in the period
- Denominator: distinct active users in the same period
WITH revenue_by_day AS ( SELECT CAST(paid_at AS DATE) AS dt, SUM(amount) AS revenue FROM payments WHERE status = 'captured' GROUP BY 1 ), active_users_by_day AS ( SELECT CAST(event_time AS DATE) AS dt, COUNT(DISTINCT user_id) AS active_users FROM events WHERE event_name IN ('session_start', 'login') GROUP BY 1 ) SELECT a.dt, a.active_users, COALESCE(r.revenue, 0) AS revenue, CASE WHEN a.active_users = 0 THEN NULL ELSE COALESCE(r.revenue, 0) * 1.0 / a.active_users END AS revenue_per_active_user FROM active_users_by_day a LEFT JOIN revenue_by_day r ON r.dt = a.dt ORDER BY 1;Using MIN/MAX to QA KPI components
When a KPI looks wrong, inspect the ranges of its components. For example, if conversion rate spikes to 200%, either sessions are undercounted, orders are duplicated, or the time alignment is inconsistent.
-- QA: find days with extreme values to investigate. WITH sessions_by_day AS ( SELECT CAST(event_time AS DATE) AS dt, COUNT(DISTINCT session_id) AS sessions FROM events WHERE event_name = 'session_start' GROUP BY 1 ), orders_by_day AS ( SELECT CAST(created_at AS DATE) AS dt, COUNT(*) AS orders FROM orders WHERE status = 'completed' GROUP BY 1 ), kpi AS ( SELECT s.dt, s.sessions, COALESCE(o.orders, 0) AS orders, CASE WHEN s.sessions = 0 THEN NULL ELSE COALESCE(o.orders, 0) * 1.0 / s.sessions END AS conversion_rate FROM sessions_by_day s LEFT JOIN orders_by_day o ON o.dt = s.dt ) SELECT MIN(conversion_rate) AS min_cr, MAX(conversion_rate) AS max_cr, MIN(sessions) AS min_sessions, MAX(sessions) AS max_sessions, MIN(orders) AS min_orders, MAX(orders) AS max_orders FROM kpi;Practical patterns to keep KPIs correct
Pattern 1: “Aggregate then join” to avoid duplicated sums
If you need multiple facts (payments, orders, sessions, users), aggregate each to the reporting grain first, then join those small result sets. This prevents row multiplication from inflating SUM and makes QA easier.
Pattern 2: Be explicit about KPI definitions and filters
Two queries can both be “correct SQL” but represent different business definitions. Make filters explicit in the base measure CTEs (e.g., paid vs authorized payments; completed vs created orders; which events define activity). Keep numerator and denominator definitions close to the KPI calculation so reviewers can validate logic quickly.
Pattern 3: Use DISTINCT intentionally, not as a patch
- Use
COUNT(DISTINCT ...)when the table grain is more detailed than the entity you need to count. - Avoid using
DISTINCTto compensate for join duplication when also summing money; fix the join or aggregate before joining. - If you must use
DISTINCT, validate both counts and sums against a no-join baseline to ensure you didn’t mask a problem.