All courses > >

SQL for Business Intelligence: Joining Tables Without Duplicating or Losing Rows

Capítulo 5

Estimated reading time: 8 minutes

Why JOINs Break BI Metrics

In BI, JOINs are used to combine business entities (for example, customers and orders) so you can compute metrics and slice them by attributes. Most incorrect dashboards come from JOINs that either (a) drop rows you expected to keep or (b) multiply rows and inflate metrics. The root cause is almost always a mismatch between the join type (INNER vs LEFT) and the “grain” (the level of detail) of each table.

Two failure modes to watch

Losing rows: using an INNER JOIN when you actually need to keep all rows from a base table (for example, customers with no orders).
Duplicating rows (metric inflation): joining a table at a more detailed grain than your base table (for example, joining order-level data to order-line-level data without aggregating first).

INNER JOIN vs LEFT JOIN in Business Scenarios

Scenario A: Orders with customer attributes (keep only orders that have a valid customer)

If your analysis is “order revenue by customer segment,” your base grain is typically orders. You want each order, enriched with customer attributes. If every order must have a customer_id, an INNER JOIN is appropriate because you only want orders that match a customer record.

SELECT  o.order_id, o.customer_id, o.order_date, o.total_amount, c.segment, c.country FROM orders o INNER JOIN customers c   ON o.customer_id = c.customer_id;

BI implication: If there are orders with missing/invalid customer_id, they will be dropped. That might be correct (data quality issue) or it might hide revenue. Decide intentionally.

Scenario B: Customers without orders (keep all customers, even if they have zero orders)

If your analysis is “how many customers have never ordered?” your base grain is customers. You must keep every customer row, and optionally attach matching orders. That requires a LEFT JOIN from customers to orders.

SELECT  c.customer_id, c.created_date, o.order_id FROM customers c LEFT JOIN orders o   ON c.customer_id = o.customer_id;

To find customers with no orders, look for NULLs from the right table.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

SELECT  c.customer_id FROM customers c LEFT JOIN orders o   ON c.customer_id = o.customer_id WHERE o.order_id IS NULL;

BI implication: LEFT JOIN preserves the customer list. But if a customer has multiple orders, the customer row will repeat once per order. That is not “wrong” by itself; it becomes wrong when you aggregate customer-level metrics without accounting for the repetition.

Grain: The Hidden Rule Behind Correct JOINs

Grain is the level of detail represented by one row in a table. JOINs are safe when the join keys align with the grain you intend to analyze. Problems happen when you join a “one row per X” table to a “many rows per X” table and then aggregate as if nothing changed.

Example: Metric inflation from joining at the wrong grain

Suppose:

orders has one row per order_id (order grain).
order_items has multiple rows per order_id (order-item grain).

If you join them and then sum an order-level amount, you will multiply the order total by the number of items in the order.

-- WRONG if o.total_amount is an order-level value: it will be repeated per item row SELECT  SUM(o.total_amount) AS revenue FROM orders o JOIN order_items oi   ON o.order_id = oi.order_id;

Why it inflates: each order row is duplicated for every matching item row. If an order has 3 items, its total_amount appears 3 times.

Fix pattern 1: Aggregate the “many” side to the join key before joining

If you need item-derived metrics (like item count) alongside order-level fields, aggregate order_items to one row per order_id first, then join.

WITH items_per_order AS (   SELECT     order_id,     COUNT(*) AS item_count,     SUM(quantity) AS units   FROM order_items   GROUP BY order_id ) SELECT   o.order_id,   o.total_amount,   i.item_count,   i.units FROM orders o LEFT JOIN items_per_order i   ON o.order_id = i.order_id;

Now the join is one-to-one at order_id, so order-level metrics remain stable.

Fix pattern 2: If you need item-level analysis, change the base grain

If the question is “revenue by product,” the base grain should be order items (or a fact table at line level). In that case, you should sum line-level amounts, not order totals.

SELECT  oi.product_id,  SUM(oi.line_amount) AS revenue FROM order_items oi GROUP BY oi.product_id;

Then join dimensions (like products) at matching keys without changing the grain.

SELECT  p.category,  SUM(oi.line_amount) AS revenue FROM order_items oi JOIN products p   ON oi.product_id = p.product_id GROUP BY p.category;

Step-by-Step: A Safe JOIN Workflow for BI

Step 1: Declare your base table and intended grain

Before writing the JOIN, decide what one row in your result should represent (customer, order, order item, day, etc.). Your base table should already be at that grain, or you should aggregate it to that grain first.

Step 2: Join only on keys that match that grain

If your result grain is orders, your join key should typically include order_id when joining facts at order grain. Joining on a broader key (like customer_id) can create many-to-many effects if both sides have multiple rows per customer.

Step 3: Isolate join effects by selecting only keys first

Before selecting lots of columns and metrics, test the join with just the keys. This makes duplication visible.

-- Start with keys only SELECT  o.order_id, o.customer_id, c.customer_id AS matched_customer_id FROM orders o LEFT JOIN customers c   ON o.customer_id = c.customer_id;

If you see unexpected NULL matches or repeated order_id values, stop and fix the join logic before adding measures.

Step 4: Test row counts before and after

Row counts are a fast way to detect row loss or duplication.

-- Base row count (orders grain) SELECT COUNT(*) AS orders_rows FROM orders;

-- After join: should usually be the same for a 1:1 join on order_id SELECT COUNT(*) AS joined_rows FROM orders o LEFT JOIN customers c   ON o.customer_id = c.customer_id;

If joined_rows is greater than orders_rows, you have duplication (likely a one-to-many or many-to-many join). If it is smaller with an INNER JOIN, you are dropping unmatched rows.

Step 5: Inspect uniqueness on both sides of the join key

For a join to behave like one-to-one, the join key should be unique on at least one side (and ideally both, depending on intent). Check whether keys repeat.

-- Are customers unique by customer_id? SELECT  customer_id, COUNT(*) AS cnt FROM customers GROUP BY customer_id HAVING COUNT(*) > 1;

-- Are orders unique by order_id? SELECT  order_id, COUNT(*) AS cnt FROM orders GROUP BY order_id HAVING COUNT(*) > 1;

If a supposed dimension table (like customers) has duplicates on its primary key, joining will multiply rows even when your fact table is clean.

Common BI JOIN Patterns (and What to Watch)

Pattern: Fact to dimension (safe when dimension key is unique)

Typical BI modeling joins a fact table (orders) to dimensions (customers, products). This is safe when the dimension has one row per key.

SELECT  o.order_id, o.total_amount, c.segment FROM orders o LEFT JOIN customers c   ON o.customer_id = c.customer_id;

Watch: if customers contains multiple rows per customer_id (for example, multiple addresses stored as separate rows), you must choose the correct record (or aggregate) before joining.

Pattern: Fact to fact (high risk of many-to-many)

Joining two transactional tables often creates many-to-many duplication unless you join at a shared unique identifier and matching grain.

Example risk: joining orders to payments on customer_id can multiply rows because both tables can have many rows per customer.

-- Risky: many orders per customer and many payments per customer SELECT  o.order_id, p.payment_id FROM orders o JOIN payments p   ON o.customer_id = p.customer_id;

Safer approach: join on a true shared key (like order_id) if payments are tied to orders, or aggregate one side to customer grain before joining if the analysis is customer-level.

Checklist: JOIN Without Duplicating or Losing Rows

1) Confirm join keys

Write down the intended grain of the result (one row per what?).
Ensure the join keys correspond to that grain (for order grain, prefer order_id; for customer grain, prefer customer_id after aggregating facts to customer level).
Avoid joining on “convenient” keys that are not unique at the grain you need.

2) Inspect uniqueness on both sides

Check whether the key is unique in the dimension table (it should be for typical BI dimensions).
Check whether the key is unique in the base table at the chosen grain.
If duplicates exist, decide whether to deduplicate, filter to a single record, or aggregate before joining.

3) Test row counts before/after

Count rows in the base dataset at the intended grain.
Count rows after the join.
If rows increase unexpectedly, you likely created duplication; if rows decrease unexpectedly, you likely dropped unmatched rows (often due to INNER JOIN).

4) Isolate join effects by selecting only keys first

Start with a query that selects only join keys from both sides.
Scan for repeated base keys (duplication) and NULL matches (row loss with INNER JOIN, or missing dimension data with LEFT JOIN).
Only after the join behaves as expected, add measures and descriptive columns.

Now answer the exercise about the content:

You need to report total order revenue, but you also want item-derived metrics (like item count) from an order_items table that has multiple rows per order_id. Which approach best avoids inflating revenue?

You are right! Congratulations, now go to the next page

You missed! Try again.

Joining an order-grain table to an item-grain table duplicates each order row per item and can multiply order-level totals. Aggregating the many side to one row per order_id makes the join effectively 1:1 and keeps revenue stable.