All courses > Technology and Programming > Databases ::

Avoiding Duplicate Explosions: One-to-Many Traps, Pre-Aggregation, and De-Duping

Capítulo 11

Estimated reading time: 15 minutes

Why “Duplicate Explosions” Happen in Joins

A “duplicate explosion” is what you see when a join multiplies rows far beyond what you expected. It usually happens when you join a table that has one row per entity (like customers or orders) to a table that has multiple related rows per entity (like order_items, payments, events, or tags). The join is logically correct, but the result set is at a more granular level than your report expects. The most common symptom is inflated totals: revenue, counts, or other aggregates become too large because each “parent” row is repeated once per matching “child” row.

Think in terms of grain (level of detail). If orders is at “one row per order” grain and order_items is at “one row per item” grain, then joining them produces “one row per item” grain. That is not wrong; it’s just different. Problems arise when you then compute order-level metrics (like total order count, total revenue per order, unique customer count) while accidentally operating on the item-level result, effectively counting the same order multiple times.

Common one-to-many traps

Joining facts to facts: orders joined to payments (multiple payments per order) and also to shipments (multiple shipments per order) can create a many-to-many multiplication (payments × shipments per order).
Joining to “detail” tables for filtering: You join order_items just to filter orders that contain a product category, but you forget that each matching item creates another row.
Joining to “latest status” tables incorrectly: You join order_status_history without restricting to one row per order (e.g., latest status), duplicating orders across status changes.
Joining to dimension tables with duplicates: A dimension that “should” be unique (like customers) actually contains multiple rows per key due to data quality issues, causing unexpected multiplication.

Recognizing the Grain Before You Join

Before writing the join, state the grain of each input and the grain you want in the output. A practical habit is to write a comment (even if you remove it later): “Output should be one row per X.” Then check whether every join preserves that grain.

Quick diagnostic questions

Is the join key unique on the left side? On the right side?
Could the right side have multiple matches per left key?
Am I selecting columns from the “many” side? If yes, my output grain likely shifts to the many-side grain.
Am I aggregating after the join? If yes, am I aggregating at the intended grain?

When you suspect duplication, validate it with a targeted query that counts matches per key.

-- How many items per order? (detect one-to-many magnitude quickly)SELECT  oi.order_id,  COUNT(*) AS item_rowsFROM order_items oiGROUP BY oi.order_idHAVING COUNT(*) > 1ORDER BY item_rows DESC;

If your report expects one row per order, this result tells you that joining orders to order_items will repeat orders.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Inflated Aggregates: The Classic Symptom

Suppose you want total revenue by customer from orders, but you join items and then sum order totals. If orders.total_amount is stored at the order level, summing it after joining to items multiplies it by the number of items.

-- WRONG if orders.total_amount is an order-level valueSELECT  o.customer_id,  SUM(o.total_amount) AS revenueFROM orders oJOIN order_items oi  ON oi.order_id = o.order_idGROUP BY o.customer_id;

Each order row is repeated once per item, so SUM(o.total_amount) is inflated. The fix depends on your goal: either aggregate items to the order level first, or compute revenue from item prices (which matches the item-level grain).

Fix option A: Pre-aggregate the many-side to the parent grain

If you need item-derived metrics (like item subtotal) but want one row per order, aggregate order_items to one row per order before joining.

WITH item_totals AS (  SELECT    oi.order_id,    SUM(oi.quantity * oi.unit_price) AS item_subtotal,    COUNT(*) AS item_count  FROM order_items oi  GROUP BY oi.order_id)SELECT  o.order_id,  o.customer_id,  o.total_amount,  it.item_subtotal,  it.item_countFROM orders oLEFT JOIN item_totals it  ON it.order_id = o.order_id;

Now the join is one-to-one at the order grain: one row per order in orders, one row per order in item_totals.

Fix option B: Aggregate at the final grain using distinct keys carefully

Sometimes you only need counts of orders after joining for filtering. You can count distinct order IDs instead of counting rows.

-- If joining items only to filter orders, count distinct ordersSELECT  o.customer_id,  COUNT(DISTINCT o.order_id) AS order_countFROM orders oJOIN order_items oi  ON oi.order_id = o.order_idWHERE oi.product_category = 'Books'GROUP BY o.customer_id;

Important: COUNT(DISTINCT ...) can be correct, but it can also become expensive on large datasets. Also, it does not fix inflated sums of order-level amounts; SUM(DISTINCT o.total_amount) is usually not safe because different orders can share the same total amount, causing undercounting.

Step-by-Step Pattern: Pre-Aggregation to Prevent Multiplication

Pre-aggregation is the most reliable way to avoid duplicate explosions when you need parent-level outputs. Use this step-by-step approach.

Step 1: Define the output grain

Example: “One row per customer per month.” Write it down, then ensure every join produces at most one matching row per customer-month.

Step 2: Identify many-side tables and decide what you need from them

For each many-side table, decide whether you need:

A single value (e.g., latest status)
Aggregates (e.g., total payments, count of items)
A boolean existence flag (e.g., “has a refund”)

Each of these can be reduced to one row per parent key before joining.

Step 3: Aggregate or select one row per key in a CTE/subquery

Example: total payments per order (multiple payments per order).

WITH payments_by_order AS (  SELECT    p.order_id,    SUM(p.amount) AS paid_amount,    MAX(p.paid_at) AS last_paid_at  FROM payments p  WHERE p.status = 'captured'  GROUP BY p.order_id)SELECT  o.order_id,  o.total_amount,  pbo.paid_amount,  pbo.last_paid_atFROM orders oLEFT JOIN payments_by_order pbo  ON pbo.order_id = o.order_id;

This prevents each payment row from duplicating the order row.

Step 4: Join the reduced result to the parent

After reduction, the join should be one-to-one (or one-to-zero/one). If it’s still one-to-many, you missed a grouping key or the data is not unique as expected.

Step 5: Validate uniqueness after the join

A simple validation is to count rows and compare to the parent table count for the same filter.

-- Validate: joined rows should equal orders rows (for the same WHERE filters)WITH payments_by_order AS (  SELECT p.order_id, SUM(p.amount) AS paid_amount  FROM payments p  WHERE p.status = 'captured'  GROUP BY p.order_id)SELECT  (SELECT COUNT(*) FROM orders) AS orders_rows,  (SELECT COUNT(*)   FROM orders o   LEFT JOIN payments_by_order pbo ON pbo.order_id = o.order_id) AS joined_rows;

If joined_rows is larger than orders_rows, the “reduced” side is still not one row per order.

De-Duping Strategies (and When to Use Them)

Sometimes you cannot pre-aggregate because you need detail columns, but you still want to avoid duplicates for a particular metric. In those cases, you can de-dup at query time. The key is to be explicit about which row you keep and why.

Strategy 1: Choose one row per key with window functions

Common need: “latest record per entity” (latest status, latest address, most recent event). Use ROW_NUMBER() to pick exactly one row per key.

WITH latest_status AS (  SELECT    osh.order_id,    osh.status,    osh.changed_at,    ROW_NUMBER() OVER (      PARTITION BY osh.order_id      ORDER BY osh.changed_at DESC    ) AS rn  FROM order_status_history osh)SELECT  o.order_id,  ls.status AS latest_status,  ls.changed_at AS status_changed_atFROM orders oLEFT JOIN latest_status ls  ON ls.order_id = o.order_id AND ls.rn = 1;

This turns a one-to-many history table into one row per order (the latest row).

Strategy 2: Use existence checks instead of joins when you only need a flag

If you only need to know whether a related row exists, joining can multiply rows unnecessarily. An EXISTS subquery avoids duplication because it doesn’t return multiple matches; it returns true/false.

SELECT  o.order_id,  o.customer_id,  CASE WHEN EXISTS (    SELECT 1    FROM refunds r    WHERE r.order_id = o.order_id      AND r.status = 'approved'  ) THEN 1 ELSE 0 END AS has_approved_refundFROM orders o;

This is often simpler and safer than joining refunds and then trying to de-dup.

Strategy 3: De-dup with DISTINCT (use sparingly and knowingly)

SELECT DISTINCT can hide duplication, but it can also hide real data issues and can be expensive. It is safest when:

You are selecting only parent-level columns (e.g., order_id, customer_id)
You intentionally want a set of unique parents after joining for filtering

-- Getting unique orders that have at least one item in a categorySELECT DISTINCT  o.order_id,  o.customer_idFROM orders oJOIN order_items oi  ON oi.order_id = o.order_idWHERE oi.product_category = 'Books';

Avoid using DISTINCT as a blanket fix when you also select many-side columns; it will not remove duplicates if the many-side columns differ, and it may mask the fact that your query grain is wrong.

The Many-to-Many “Multiplier” Trap (Two One-to-Many Joins)

A particularly dangerous pattern is joining a parent to two different child tables that are each one-to-many. Even if each join individually makes sense, doing both can create a many-to-many explosion at the parent level.

Example: orders joined to payments (multiple per order) and shipments (multiple per order). The result can have one row per (payment × shipment) combination for the same order.

-- DANGEROUS: payments x shipments per orderSELECT  o.order_id,  p.payment_id,  s.shipment_idFROM orders oLEFT JOIN payments p  ON p.order_id = o.order_idLEFT JOIN shipments s  ON s.order_id = o.order_id;

If an order has 3 payments and 2 shipments, you get 6 rows for that order. Any order-level measure summed here will be inflated by a factor of 6.

Fix: Reduce each child table to one row per order before joining

WITH payments_by_order AS (  SELECT    p.order_id,    SUM(p.amount) AS paid_amount,    MAX(p.paid_at) AS last_paid_at  FROM payments p  WHERE p.status = 'captured'  GROUP BY p.order_id),shipments_by_order AS (  SELECT    s.order_id,    COUNT(*) AS shipment_count,    MAX(s.shipped_at) AS last_shipped_at  FROM shipments s  WHERE s.status IN ('shipped','delivered')  GROUP BY s.order_id)SELECT  o.order_id,  o.total_amount,  pbo.paid_amount,  pbo.last_paid_at,  sbo.shipment_count,  sbo.last_shipped_atFROM orders oLEFT JOIN payments_by_order pbo  ON pbo.order_id = o.order_idLEFT JOIN shipments_by_order sbo  ON sbo.order_id = o.order_id;

This preserves one row per order while still incorporating metrics from both child tables.

Filtering Without Multiplying: Semi-Joins and Pre-Filtered Keys

Another common trap: you join a detail table only to filter the parent, but the join multiplies rows. Two robust patterns avoid this.

Pattern A: EXISTS (semi-join)

SELECT  o.order_id,  o.customer_id,  o.created_atFROM orders oWHERE EXISTS (  SELECT 1  FROM order_items oi  WHERE oi.order_id = o.order_id    AND oi.product_category = 'Books');

This returns each order once, regardless of how many matching items it has.

Pattern B: Join to a de-duplicated key list

If you prefer joins (or need to reuse the filtered set), build a distinct list of parent keys first.

WITH book_orders AS (  SELECT DISTINCT oi.order_id  FROM order_items oi  WHERE oi.product_category = 'Books')SELECT  o.order_id,  o.customer_id,  o.created_atFROM orders oJOIN book_orders bo  ON bo.order_id = o.order_id;

This keeps the join at one-to-one (orders to unique order_id list).

Debugging Duplicate Explosions: A Practical Checklist

1) Measure row counts after each join

Build your query incrementally and check counts at each stage. If you see a sudden jump, the last join introduced multiplication.

-- Start with a base setWITH base AS (  SELECT o.order_id  FROM orders o  WHERE o.created_at >= DATE '2025-01-01')SELECT COUNT(*) AS base_rows FROM base;

-- Add a join and re-checkWITH base AS (  SELECT o.order_id  FROM orders o  WHERE o.created_at >= DATE '2025-01-01')SELECT COUNT(*) AS rows_after_joinFROM base bJOIN order_items oi ON oi.order_id = b.order_id;

2) Compute “matches per key” on the joined side

When a join multiplies, quantify it.

SELECT  oi.order_id,  COUNT(*) AS matchesFROM order_items oiGROUP BY oi.order_idORDER BY matches DESC;

3) Verify uniqueness assumptions explicitly

If you believe a table should have one row per key, test it.

-- Detect duplicate keys in a dimension tableSELECT  c.customer_id,  COUNT(*) AS rows_per_customerFROM customers cGROUP BY c.customer_idHAVING COUNT(*) > 1;

If this returns rows, any join to customers will multiply results for those customers until the data is fixed or you choose a de-dup rule (e.g., latest record).

Choosing the Right Fix: A Decision Guide

You want parent-level output and child-derived metrics: pre-aggregate child to parent key(s) and join.
You want parent-level output and only need to know if child exists: use EXISTS or join to a distinct key list.
You want one representative child row (latest, highest priority): use window functions (ROW_NUMBER) and filter to one row per parent.
You truly want detail-level output: accept the multiplication, but ensure your aggregates match the detail grain (sum item amounts, not order totals).
You see unexpected multiplication even when you expect uniqueness: test for duplicate keys and fix data or de-dup deterministically.

Worked Example: Building a Safe Customer Revenue Report

Goal: one row per customer with (a) number of orders, (b) total item revenue, (c) total captured payments, and (d) whether the customer ever had an approved refund. The trap is that orders have many items, many payments, and potentially many refunds. If you join all detail tables directly, you risk many-to-many multiplication.

Step 1: Build order-level aggregates from each detail table

WITH item_rev_by_order AS (  SELECT    oi.order_id,    SUM(oi.quantity * oi.unit_price) AS item_revenue  FROM order_items oi  GROUP BY oi.order_id),payments_by_order AS (  SELECT    p.order_id,    SUM(p.amount) AS captured_amount  FROM payments p  WHERE p.status = 'captured'  GROUP BY p.order_id),refund_flag_by_order AS (  SELECT    r.order_id,    1 AS has_approved_refund  FROM refunds r  WHERE r.status = 'approved'  GROUP BY r.order_id)

Each CTE returns at most one row per order.

Step 2: Join aggregates to orders (still one row per order)

WITH item_rev_by_order AS (  SELECT oi.order_id, SUM(oi.quantity * oi.unit_price) AS item_revenue  FROM order_items oi  GROUP BY oi.order_id),payments_by_order AS (  SELECT p.order_id, SUM(p.amount) AS captured_amount  FROM payments p  WHERE p.status = 'captured'  GROUP BY p.order_id),refund_flag_by_order AS (  SELECT r.order_id, 1 AS has_approved_refund  FROM refunds r  WHERE r.status = 'approved'  GROUP BY r.order_id),orders_enriched AS (  SELECT    o.order_id,    o.customer_id,    COALESCE(iro.item_revenue, 0) AS item_revenue,    COALESCE(pbo.captured_amount, 0) AS captured_amount,    COALESCE(rfo.has_approved_refund, 0) AS has_approved_refund  FROM orders o  LEFT JOIN item_rev_by_order iro ON iro.order_id = o.order_id  LEFT JOIN payments_by_order pbo ON pbo.order_id = o.order_id  LEFT JOIN refund_flag_by_order rfo ON rfo.order_id = o.order_id)SELECT  customer_id,  COUNT(*) AS orders_count,  SUM(item_revenue) AS total_item_revenue,  SUM(captured_amount) AS total_captured_amount,  MAX(has_approved_refund) AS ever_approved_refundFROM orders_enrichedGROUP BY customer_id;

Notice the final aggregation: since orders_enriched is one row per order, counting orders is safe, summing revenues is safe, and using MAX for a boolean flag is safe.

Practical Notes: Performance and Correctness Tradeoffs

Pre-aggregation often improves performance because it reduces row counts before joins, but it can also add compute for grouping. In practice, grouping a large detail table by a well-indexed foreign key is usually cheaper than joining the full detail table and then trying to recover correctness with DISTINCT or complex aggregates.

When using window functions for de-duplication, ensure your ordering is deterministic. If two rows tie on the ordering column (e.g., same timestamp), add a secondary tie-breaker (like an ID) to avoid non-deterministic “latest row” selection.

ROW_NUMBER() OVER (  PARTITION BY osh.order_id  ORDER BY osh.changed_at DESC, osh.status_history_id DESC) AS rn

Finally, treat unexpected duplication as a signal: either the query grain is different from what you intended, or the data violates uniqueness assumptions. Both are fixable, but only if you detect them early and choose a deliberate strategy (pre-aggregate, existence checks, or deterministic de-duplication) rather than relying on accidental cancellation.

Now answer the exercise about the content:

You need one row per order, but you must include metrics from order_items and payments, both of which can have multiple rows per order. What is the safest approach to avoid duplicated rows and inflated aggregates?

You are right! Congratulations, now go to the next page

You missed! Try again.

Reducing each many-side table to one row per order before joining preserves the intended grain and prevents many-to-many multiplication. DISTINCT can hide issues, and SUM(DISTINCT ...) can undercount when different orders share the same total.