All courses > >

SQL for Business Intelligence: Aggregations and KPI Calculations with COUNT, SUM, AVG

Capítulo 6

Estimated reading time: 10 minutes

Compute base measures at the correct grain

Most KPI mistakes come from calculating on the wrong “grain” (the level of detail of the rows you are aggregating). Before writing any KPI query, decide the grain you want for the base measures (for example: per order, per user-day, per session, per product-day). Then aggregate from that grain upward (daily, weekly, monthly, by channel, etc.).

Base measures you will reuse

Revenue: usually the sum of paid amounts (often from payments) or net item revenue (from order items). Define whether it is gross, net, includes tax/shipping, includes refunds, etc.
Orders: usually count of distinct orders that meet a status rule (e.g., completed/paid).
Active users: usually count of distinct users with at least one qualifying event in a time window (e.g., login, session, purchase).

COUNT vs COUNT(DISTINCT): what you are really counting

COUNT(*) counts rows. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique values (excluding NULL). The key question is: are your rows already at the entity grain you want to count?

If your dataset is at the order grain (one row per order), then COUNT(*) is “number of orders.” If your dataset is at the order-item grain (multiple rows per order), then COUNT(*) is “number of order items,” and you need COUNT(DISTINCT order_id) to count orders.

-- Example: order_items has multiple rows per order_id (one per item line). Count orders vs items per day. SELECT   CAST(created_at AS DATE) AS order_date,   COUNT(*) AS item_lines,   COUNT(DISTINCT order_id) AS orders FROM order_items GROUP BY 1 ORDER BY 1;

When DISTINCT is necessary (and when it hides join problems)

DISTINCT is necessary when your current row grain is more detailed than the entity you’re counting (e.g., counting users from an events table, counting orders from order_items). But COUNT(DISTINCT ...) can also hide duplication caused by joins. If a join accidentally multiplies rows, COUNT(DISTINCT order_id) might still look correct while SUM(revenue) becomes inflated.

A practical rule: if you need both counts and sums, don’t “fix” the query with DISTINCT at the end. Instead, compute base measures in separate subqueries at their natural grain, then join those aggregated results.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

-- Risky pattern: DISTINCT makes order counts look fine, but revenue may be duplicated. SELECT   CAST(o.created_at AS DATE) AS order_date,   COUNT(DISTINCT o.order_id) AS orders,   SUM(p.amount) AS revenue FROM orders o JOIN payments p ON p.order_id = o.order_id JOIN order_items oi ON oi.order_id = o.order_id  -- multiplies payment rows if multiple items WHERE o.status = 'completed' GROUP BY 1;

-- Safer pattern: aggregate each fact at its own grain, then combine. WITH orders_by_day AS (   SELECT     CAST(created_at AS DATE) AS order_date,     COUNT(*) AS orders   FROM orders   WHERE status = 'completed'   GROUP BY 1 ), payments_by_day AS (   SELECT     CAST(paid_at AS DATE) AS pay_date,     SUM(amount) AS revenue   FROM payments   WHERE status = 'captured'   GROUP BY 1 ) SELECT   o.order_date,   o.orders,   COALESCE(p.revenue, 0) AS revenue FROM orders_by_day o LEFT JOIN payments_by_day p   ON p.pay_date = o.order_date ORDER BY 1;

SUM, AVG, MIN, MAX: what they mean at different grains

SUM adds values across rows. It is only correct if each row represents a non-overlapping contribution to the total (for example, one payment row per captured payment, or one order_item row per item line). If joins duplicate rows, SUM inflates immediately.

AVG is the average of row values. It is not the same as “total divided by total” unless each row represents the same unit. For example, AVG(order_total) is average order value only if there is one row per order. If you compute AVG(price) on order_items, you are averaging item prices, not order values.

MIN/MAX are useful for QA and sanity checks (earliest/latest date, smallest/largest order, min/max daily revenue). They can also reveal unexpected outliers or data gaps.

-- QA checks: daily revenue range and order total range. SELECT   CAST(paid_at AS DATE) AS pay_date,   SUM(amount) AS revenue,   MIN(amount) AS min_payment,   MAX(amount) AS max_payment FROM payments WHERE status = 'captured' GROUP BY 1 ORDER BY 1;

Validate totals before building KPIs

Before layering derived KPIs, validate that your base measures match known totals (finance dashboards, billing system exports, or a trusted report). Do this validation at multiple levels: overall totals, by day, and by a key dimension (channel, country, product category). If totals don’t match, fix the grain/join/filter logic first.

Step-by-step validation workflow

Step 1: Validate row counts at the base table (e.g., number of captured payments, number of completed orders).
Step 2: Validate totals without joins (e.g., total captured revenue from payments only).
Step 3: Validate totals by time (daily/weekly sums should add up to the overall total).
Step 4: Add one join at a time and re-check totals to detect where duplication appears.

-- Step 2: total revenue without joins. SELECT SUM(amount) AS total_revenue FROM payments WHERE status = 'captured';

-- Step 3: daily revenue; sum of these should match total_revenue. SELECT   CAST(paid_at AS DATE) AS pay_date,   SUM(amount) AS revenue FROM payments WHERE status = 'captured' GROUP BY 1 ORDER BY 1;

Common KPI base measures: revenue, orders, active users

Compute each base measure at a clear grain, then aggregate to the reporting grain (often day). Below are common patterns.

-- Revenue per day (payments grain). SELECT   CAST(paid_at AS DATE) AS dt,   SUM(amount) AS revenue FROM payments WHERE status = 'captured' GROUP BY 1;

-- Orders per day (orders grain). SELECT   CAST(created_at AS DATE) AS dt,   COUNT(*) AS orders FROM orders WHERE status = 'completed' GROUP BY 1;

-- Active users per day (events grain). SELECT   CAST(event_time AS DATE) AS dt,   COUNT(DISTINCT user_id) AS active_users FROM events WHERE event_name IN ('session_start', 'login') GROUP BY 1;

Notice how COUNT(DISTINCT user_id) is appropriate for active users because events are typically many-per-user-per-day. If you instead had a user_day_activity table with one row per user per day, you could use COUNT(*) for active users.

Layer derived KPIs with careful numerator/denominator logic

Derived KPIs are ratios or combinations of base measures. The safest approach is: (1) compute each base measure in its own subquery at the correct grain, (2) join those aggregated results on the shared keys (like date), (3) compute the KPI using explicit numerator/denominator logic, including handling divide-by-zero.

AOV (Average Order Value): total revenue / total orders

AOV is a classic example where AVG can be misleading if you’re not at the order grain. The robust definition is:

Numerator: total revenue in the period
Denominator: number of orders in the same period

WITH revenue_by_day AS (   SELECT CAST(paid_at AS DATE) AS dt, SUM(amount) AS revenue   FROM payments   WHERE status = 'captured'   GROUP BY 1 ), orders_by_day AS (   SELECT CAST(created_at AS DATE) AS dt, COUNT(*) AS orders   FROM orders   WHERE status = 'completed'   GROUP BY 1 ) SELECT   o.dt,   o.orders,   COALESCE(r.revenue, 0) AS revenue,   CASE WHEN o.orders = 0 THEN NULL ELSE COALESCE(r.revenue, 0) * 1.0 / o.orders END AS aov FROM orders_by_day o LEFT JOIN revenue_by_day r ON r.dt = o.dt ORDER BY 1;

Why not AVG(order_total)? It can be fine if you have a reliable order_total column and exactly one row per order, but it becomes fragile if order totals are computed via joins to items, discounts, shipping lines, refunds, etc. The “sum divided by count” approach makes the numerator and denominator explicit and easier to validate.

Conversion rate: orders / sessions (or purchasers / visitors)

Conversion rate depends on your definition of “opportunity” (sessions, visitors, product views) and your definition of “conversion” (orders, paid orders, first purchase). The key is aligning the time window and grain for numerator and denominator.

Numerator: orders (or purchasers) in the period
Denominator: sessions (or visitors) in the same period

WITH sessions_by_day AS (   SELECT     CAST(event_time AS DATE) AS dt,     COUNT(DISTINCT session_id) AS sessions   FROM events   WHERE event_name = 'session_start'   GROUP BY 1 ), orders_by_day AS (   SELECT     CAST(created_at AS DATE) AS dt,     COUNT(*) AS orders   FROM orders   WHERE status = 'completed'   GROUP BY 1 ) SELECT   s.dt,   s.sessions,   COALESCE(o.orders, 0) AS orders,   CASE WHEN s.sessions = 0 THEN NULL ELSE COALESCE(o.orders, 0) * 1.0 / s.sessions END AS conversion_rate FROM sessions_by_day s LEFT JOIN orders_by_day o ON o.dt = s.dt ORDER BY 1;

If you instead join events directly to orders at the row level, you can easily multiply orders by the number of events per session/user. Aggregating first prevents that.

ARPU / Revenue per active user: revenue / active users

This KPI is sensitive to definitions. “Active users” must be defined consistently (which events count, and whether bots/internal users are excluded). Then compute:

Numerator: revenue in the period
Denominator: distinct active users in the same period

WITH revenue_by_day AS (   SELECT CAST(paid_at AS DATE) AS dt, SUM(amount) AS revenue   FROM payments   WHERE status = 'captured'   GROUP BY 1 ), active_users_by_day AS (   SELECT CAST(event_time AS DATE) AS dt, COUNT(DISTINCT user_id) AS active_users   FROM events   WHERE event_name IN ('session_start', 'login')   GROUP BY 1 ) SELECT   a.dt,   a.active_users,   COALESCE(r.revenue, 0) AS revenue,   CASE WHEN a.active_users = 0 THEN NULL ELSE COALESCE(r.revenue, 0) * 1.0 / a.active_users END AS revenue_per_active_user FROM active_users_by_day a LEFT JOIN revenue_by_day r ON r.dt = a.dt ORDER BY 1;

Using MIN/MAX to QA KPI components

When a KPI looks wrong, inspect the ranges of its components. For example, if conversion rate spikes to 200%, either sessions are undercounted, orders are duplicated, or the time alignment is inconsistent.

-- QA: find days with extreme values to investigate. WITH sessions_by_day AS (   SELECT CAST(event_time AS DATE) AS dt, COUNT(DISTINCT session_id) AS sessions   FROM events   WHERE event_name = 'session_start'   GROUP BY 1 ), orders_by_day AS (   SELECT CAST(created_at AS DATE) AS dt, COUNT(*) AS orders   FROM orders   WHERE status = 'completed'   GROUP BY 1 ), kpi AS (   SELECT     s.dt,     s.sessions,     COALESCE(o.orders, 0) AS orders,     CASE WHEN s.sessions = 0 THEN NULL ELSE COALESCE(o.orders, 0) * 1.0 / s.sessions END AS conversion_rate   FROM sessions_by_day s   LEFT JOIN orders_by_day o ON o.dt = s.dt ) SELECT   MIN(conversion_rate) AS min_cr,   MAX(conversion_rate) AS max_cr,   MIN(sessions) AS min_sessions,   MAX(sessions) AS max_sessions,   MIN(orders) AS min_orders,   MAX(orders) AS max_orders FROM kpi;

Practical patterns to keep KPIs correct

Pattern 1: “Aggregate then join” to avoid duplicated sums

If you need multiple facts (payments, orders, sessions, users), aggregate each to the reporting grain first, then join those small result sets. This prevents row multiplication from inflating SUM and makes QA easier.

Pattern 2: Be explicit about KPI definitions and filters

Two queries can both be “correct SQL” but represent different business definitions. Make filters explicit in the base measure CTEs (e.g., paid vs authorized payments; completed vs created orders; which events define activity). Keep numerator and denominator definitions close to the KPI calculation so reviewers can validate logic quickly.

Pattern 3: Use DISTINCT intentionally, not as a patch

Use COUNT(DISTINCT ...) when the table grain is more detailed than the entity you need to count.
Avoid using DISTINCT to compensate for join duplication when also summing money; fix the join or aggregate before joining.
If you must use DISTINCT, validate both counts and sums against a no-join baseline to ensure you didn’t mask a problem.

Now answer the exercise about the content:

You need a daily report with both number of completed orders and captured revenue, but joining orders, payments, and order_items multiplies rows and inflates SUM(amount). What is the safest approach to keep both counts and sums correct?

You are right! Congratulations, now go to the next page

You missed! Try again.

Aggregating base measures at the correct grain before joining prevents row multiplication from inflating sums. It also keeps numerator and denominator logic explicit for reliable KPI calculations.