Think in Sets of Rows, Not in Tables
A join is easiest to understand when you stop thinking of it as “combining two tables” and start thinking of it as “producing a new set of rows.” Each input (often called the left and right side) is a set of rows. The join condition tells the database how to decide whether a row from the left side matches a row from the right side. The output is another set of rows whose shape (columns) comes from both inputs, but whose size (row count) depends entirely on how many matches exist.
This mental model helps you predict results before you run a query. When you can predict the output row count and which rows will appear, you can avoid common mistakes like accidental duplication, missing records, or turning an outer join into an inner join by accident.
The join pipeline mental model
Most joins can be reasoned about as a pipeline with three steps:
- Step 1: Choose the left input set (a table, a filtered table, or even a derived subquery).
- Step 2: Choose the right input set (same options).
- Step 3: Apply the join condition to decide which pairs of rows match. Depending on join type, unmatched rows may be kept or discarded.
Notice what is not in the list: “merge rows by key.” That’s a common description, but it hides the most important behavior: a join can create multiple output rows from a single input row whenever there are multiple matches.
Keys as a Matching Contract (Not Just a Column)
Keys are the most common way to express “these rows correspond.” But a key is not merely a column name; it’s a contract about uniqueness and meaning.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Three key roles you should distinguish
- Primary key (PK): uniquely identifies a row within a table (e.g.,
customers.customer_id). - Foreign key (FK): points to a row in another table (e.g.,
orders.customer_idreferencescustomers.customer_id). - Join key: whatever columns you actually use in the join condition. Ideally it aligns with PK/FK relationships, but it can be something else (like email, SKU, or a composite of multiple columns).
The mental model: the join key defines the matching rule. If the key is unique on one side and not unique on the other, you should expect duplication. If it is not unique on either side, you should expect a multiplication effect.
Uniqueness determines whether you get 1-to-1, 1-to-many, or many-to-many output
Before writing the query, ask: “For each row on the left, how many rows on the right can match?”
- 1-to-1: each left row matches at most one right row (common when joining PK to PK, or PK to a unique key).
- 1-to-many: each left row can match multiple right rows (common when joining a PK to an FK).
- many-to-many: each left row can match multiple right rows and vice versa (often indicates you need a bridge table or you’re joining on a non-unique attribute).
This is the single most useful predictor of output row counts.
Matches Are Pairs (and Pairs Become Rows)
A join does not “attach” one row to another; it forms pairs of rows that satisfy the join condition. Each pair becomes one output row.
Imagine writing down every possible pair of left-row and right-row (conceptually). The join condition filters those pairs to “matching pairs.” The join type then decides what to do with left rows or right rows that never formed a matching pair.
Concrete example schema
Use a small, realistic dataset to build intuition:
customers(customer_id, name) -- customer_id is unique (PK)orders(order_id, customer_id, order_total) -- many orders per customerSample data:
customers: orders: (order_id, customer_id, order_total) 1, 10, 50.00 2, 10, 25.00 3, 20, 40.00 4, 99, 15.00 -- customer_id 99 doesn't exist in customers(customer_id, name) 10, 'Ava' 20, 'Ben' 30, 'Cara' -- Cara has no ordersNow consider the matching rule: customers.customer_id = orders.customer_id.
- Customer 10 matches two orders (order 1 and 2) → two output rows for Ava.
- Customer 20 matches one order → one output row for Ben.
- Customer 30 matches zero orders → depends on join type whether Cara appears.
- Order with customer_id 99 matches zero customers → depends on join direction and join type whether that order appears.
That’s the “pairs become rows” model in action.
Unmatched Records: What “Outer” Really Means
“Unmatched” means: a row on one side did not find any row on the other side that satisfies the join condition. Outer joins are about whether you keep those unmatched rows and how you represent missing columns (typically as NULL values in the output).
Left side preserved vs right side preserved
In a left join, the left input is preserved: every left row appears at least once in the output. If it has no match, it still appears, but the right-side columns are NULL.
In a right join, the right input is preserved: every right row appears at least once, and unmatched left-side columns become NULL.
In a full join, both sides are preserved: you keep unmatched rows from both sides.
The mental model: outer join = preserve one or both sides, filling missing columns with NULLs.
Step-by-Step: Predicting Join Output Before You Run It
When joins get complex, you can still reason about them reliably by following a repeatable checklist.
Step 1: Identify the grain (row meaning) of each input
Ask: “What does one row represent?” Examples:
customers: one row per customer.orders: one row per order.order_items: one row per item line in an order.
If you join tables with different grains, you should expect duplication of the higher-level entity. Joining customers to orders duplicates customers across their orders. Joining orders to order_items duplicates orders across their items.
Step 2: Identify the join keys and their uniqueness
Determine whether the join key is unique on either side. If you don’t know, you can test with a quick query (or at least reason from constraints):
-- Check duplicates of a supposed unique key (generic pattern) SELECT customer_id, COUNT(*) AS cnt FROM customers GROUP BY customer_id HAVING COUNT(*) > 1;If the “unique” side isn’t actually unique, your join may multiply rows unexpectedly.
Step 3: Estimate match cardinality
For each left row, how many right rows can match?
- If “0 or 1,” output rows will be at most the left row count (for inner join, possibly fewer; for left join, exactly equal to left row count).
- If “0 to many,” output rows can exceed the left row count.
Step 4: Decide what happens to unmatched rows
Pick the join type based on which side’s unmatched rows you need to keep. Then remember: unmatched rows show up with NULL columns from the missing side.
Step 5: Watch for filters that change the meaning
Filters can be applied before the join (in a subquery), during the join (in the ON clause), or after the join (in the WHERE clause). These placements can change whether unmatched rows survive.
The ON vs WHERE Mental Model: Don’t Accidentally Remove Unmatched Rows
A common source of confusion is placing conditions in WHERE that should be in ON, especially with outer joins.
Left join example: keeping customers with no orders
Suppose you want all customers, and only orders above $30 when they exist.
Condition in ON (preserves unmatched customers):
SELECT c.customer_id, c.name, o.order_id, o.order_total FROM customers c LEFT JOIN orders o ON o.customer_id = c.customer_id AND o.order_total > 30;Mental model: you are defining what counts as a “match.” Orders under or equal to 30 are treated as “not matching,” so customers may become unmatched and still appear with NULL order columns.
Condition in WHERE (can remove unmatched customers):
SELECT c.customer_id, c.name, o.order_id, o.order_total FROM customers c LEFT JOIN orders o ON o.customer_id = c.customer_id WHERE o.order_total > 30;Mental model: the left join first produces rows, including customers with NULL in o.order_total when they have no orders. Then WHERE o.order_total > 30 removes those rows because NULL does not satisfy the condition. The result behaves like an inner join for that condition.
Rule of thumb: conditions that define matching belong in ON; conditions that filter final results belong in WHERE. With outer joins, this distinction is crucial.
NULLs and Matches: Equality Doesn’t Work the Way You Think
NULL means “unknown” (or “missing”), and comparisons with NULL don’t behave like normal values. In most SQL dialects, NULL = NULL is not true; it is unknown. That affects join matching.
Example: joining on nullable columns
Imagine two tables with a nullable email column used as a join key. Rows where both emails are NULL will not match under =. If you intended them to match, you must use a null-safe comparison (dialect-specific) or coalesce to a sentinel value (carefully).
-- Generic pattern (be cautious with sentinel values) SELECT * FROM a JOIN b ON COALESCE(a.email, '') = COALESCE(b.email, '');Mental model: NULLs usually break matching unless you explicitly handle them. Prefer joining on non-nullable keys (PK/FK) whenever possible.
Row Multiplication: The “Fan-Out” Mental Model
When a left row matches multiple right rows, the left row’s columns are repeated in the output. This is not duplication in the sense of “bad data”; it is the correct representation of multiple matching pairs. But it can become a problem when you aggregate after the join.
Example: why sums can inflate
Suppose you join customers to orders and then to order_items. If you sum orders.order_total after joining to items, each order row may be repeated once per item, inflating the sum.
Mental model: each additional 1-to-many join can fan out rows further. Aggregations must be done at the correct grain.
Practical step-by-step to prevent inflated aggregates
- Step 1: Decide the grain you want to aggregate at (e.g., per order, per customer).
- Step 2: Aggregate at that grain before joining to a lower-grain table.
- Step 3: Join the aggregated result onward.
-- Aggregate order_items to order level first (pattern) WITH item_totals AS ( SELECT order_id, SUM(line_total) AS items_total FROM order_items GROUP BY order_id ) SELECT o.order_id, o.customer_id, it.items_total FROM orders o LEFT JOIN item_totals it ON it.order_id = o.order_id;Mental model: aggregate before you fan out.
Anti-Matches: Finding What Didn’t Match
Sometimes the most valuable insight is the set of unmatched records: customers with no orders, orders with no customer, products never sold, etc. This is where the “unmatched rows become NULLs” model becomes a tool.
Find left-side rows with no match
Use a left join and then filter for NULL on a right-side key that would be non-null if a match existed.
SELECT c.customer_id, c.name FROM customers c LEFT JOIN orders o ON o.customer_id = c.customer_id WHERE o.order_id IS NULL;Mental model: after the left join, unmatched customers have NULL in all orders columns, including order_id. Filtering WHERE o.order_id IS NULL isolates the unmatched set.
Find right-side rows with no match
Flip the perspective (or use a right join if your dialect supports it consistently):
SELECT o.order_id, o.customer_id, o.order_total FROM orders o LEFT JOIN customers c ON c.customer_id = o.customer_id WHERE c.customer_id IS NULL;This identifies orders whose customer_id doesn’t exist in customers (or whose join key is missing).
Composite Keys: Matching on Multiple Columns
Not all relationships are expressed by a single column. Sometimes the correct match requires multiple columns (a composite key), such as (store_id, sku) or (country_code, national_id).
Mental model: the join key is the whole tuple
When joining on multiple columns, the match is defined by the combination. A row matches only if all key parts match (unless you explicitly use different logic).
SELECT p.store_id, p.sku, p.price, i.on_hand FROM prices p JOIN inventory i ON i.store_id = p.store_id AND i.sku = p.sku;Practical check: if you accidentally join on only one part of a composite key (e.g., sku alone), you can create many-to-many matches and a large fan-out.
Self-Consistency Checks: Validate Your Mental Model with Small Queries
Even with a strong mental model, it’s good practice to validate assumptions about uniqueness and match rates. You can do this with targeted queries that are cheap to run and easy to interpret.
Check how many right matches each left row has
SELECT c.customer_id, COUNT(o.order_id) AS order_count FROM customers c LEFT JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.customer_id;This reveals whether you truly have 0/1 matches or 0/many, and it surfaces unexpected spikes (e.g., a customer with 10,000 orders due to a data issue).
Check for join key quality issues
If you join on natural keys (like email), check for duplicates and nulls:
SELECT email, COUNT(*) AS cnt FROM customers GROUP BY email HAVING email IS NULL OR COUNT(*) > 1;Mental model: bad join keys create bad matches. Duplicates create fan-out; nulls create non-matches; inconsistent formatting creates missed matches.
Choosing the “Left” Side Is a Design Decision
SQL syntax forces you to pick a left and right input, but your mental model should treat this as a deliberate choice: which set of rows do you want to preserve? In reporting queries, the preserved side often corresponds to the “entity list” you want to see even when there is no related data (customers, products, dates). In data quality queries, the preserved side is the one you’re auditing for missing relationships.
Practical step-by-step: start from the entity you must not lose
- Step 1: Write a query that returns exactly the entity rows you need (often from one table or a dimension).
- Step 2: Add joins one at a time, predicting fan-out and unmatched behavior each time.
- Step 3: After each join, run a quick row count check (or group by the entity key) to confirm you didn’t unintentionally multiply or drop entities.
-- Pattern: check whether you preserved one row per customer SELECT c.customer_id, COUNT(*) AS rows_after_joins FROM customers c LEFT JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.customer_id;If you expected one row per customer but see multiple rows, that’s not necessarily wrong; it means your current query grain is now “customer-order” rather than “customer.” You can then decide whether to aggregate, select distinct, or restructure.