All courses > >

SQL for Business Intelligence: Subqueries for Reusable Filters and Pre-Aggregation

Capítulo 9

Estimated reading time: 9 minutes

Why subqueries matter in BI queries

In BI work, you often need the same business rule in multiple places: “active customers,” “customers who bought recently,” “orders that contain a certain product category,” or “accounts with overdue invoices.” Subqueries let you package that logic so you can reuse it without copying long filter lists or repeating complex joins. A subquery is simply a query nested inside another query. You can use it to return a single value (scalar subquery), a set of values for filtering (IN), a yes/no existence check (EXISTS), or a pre-aggregated result set you join to (derived table).

A stepwise workflow: build the inner query first

Subqueries are easiest to get right when you treat them as two separate queries and validate the inner one before embedding it.

Step 1: Write the inner query alone. Make sure it returns what you think it returns.
Step 2: Confirm the grain. Ask: “What does one row represent?” (one customer, one order, one day?)
Step 3: Check uniqueness. If the outer query expects one row per key (e.g., one row per customer), ensure the inner query returns at most one row per customer (or aggregate/deduplicate until it does).
Step 4: Integrate. Use the inner query as a scalar value, a filter set, an EXISTS condition, or a derived table to join.
Step 5: QA with small samples. Spot-check counts and a few IDs to confirm the filter/join behaves as intended.

Scalar subqueries: returning a single value

A scalar subquery returns exactly one value (one row, one column). In BI, this is useful for comparing rows to a benchmark, threshold, or “latest date” computed from the data.

Example: compare each order date to the latest order date

Step 1: inner query (validate it returns one value)

SELECT MAX(order_date) AS max_order_date FROM orders;

Step 2: use it in the outer query

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

SELECT o.order_id, o.customer_id, o.order_date, CASE WHEN o.order_date = (SELECT MAX(order_date) FROM orders) THEN 1 ELSE 0 END AS is_latest_order_date FROM orders o;

Grain check: the inner query returns one value for the whole table, so it is safe to use as a scalar expression.

Common pitfall: scalar subquery returns multiple rows

If your inner query can return more than one row, many databases will error (or behave unexpectedly). If you intend one value, force it with an aggregate or a filter that guarantees one row.

-- Risky if multiple currencies exist: returns multiple rows (one per currency) SELECT currency, MAX(order_date) FROM orders GROUP BY currency;

If you need a per-customer scalar value, that is no longer a scalar subquery for the whole query; it becomes a correlated subquery (evaluated per outer row) or a derived table joined by customer_id.

IN subqueries: reusable filter lists

Use IN when you want to filter rows based on membership in a set returned by another query. This is common for “customers who meet condition X” filters.

Pattern: customers who purchased in the last 30 days

Step 1: inner query returns the set of qualifying customer IDs

SELECT DISTINCT o.customer_id FROM orders o WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 day';

Step 2: outer query uses that set to filter customers

SELECT c.customer_id, c.customer_name, c.segment FROM customers c WHERE c.customer_id IN ( SELECT DISTINCT o.customer_id FROM orders o WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 day' );

Grain check: the inner query is “one row per customer_id” because of DISTINCT. Even if you forget DISTINCT, IN still works with duplicates, but DISTINCT makes the intent clear and can reduce work.

When IN is a good fit

When the inner query naturally returns a list of keys (customer_id, product_id, account_id).
When you want to keep the outer query focused on the entity you are reporting on (e.g., customers) and treat the inner query as a reusable “qualifying set.”

EXISTS subqueries: filtering by “at least one matching row”

EXISTS answers a yes/no question: “Does at least one related row exist that matches these conditions?” It is often a strong choice for BI filters because it avoids returning a list and it naturally expresses “at least one.” It also avoids issues with duplicates because it stops at the first match.

Pattern: customers who purchased in the last 30 days (EXISTS)

Step 1: inner query (validate logic with a standalone version)

SELECT o.customer_id, o.order_id, o.order_date FROM orders o WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 day';

Step 2: convert to EXISTS by correlating on customer_id

SELECT c.customer_id, c.customer_name, c.segment FROM customers c WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.order_date >= CURRENT_DATE - INTERVAL '30 day' );

Grain check: the inner query is evaluated per customer row. It can return many orders for a customer, but EXISTS only needs one match, so duplicates do not change the result.

IN vs EXISTS: practical guidance

Use IN when you already have (or want) a distinct list of IDs and it reads naturally.
Use EXISTS when the condition is “at least one related row” and you want to avoid thinking about duplicates in the inner query.
QA tip: if IN and EXISTS are both valid, run both and compare counts for a small date range; mismatches often reveal subtle join/filter differences.

Derived tables: pre-aggregate before joining to prevent duplication

A derived table is a subquery in the FROM clause. In BI, the most important use is pre-aggregation: compute metrics at the right grain first, then join those results to a dimension table. This prevents accidental multiplication of rows when joining detailed tables.

Why pre-aggregation prevents duplication

Suppose you want a customer report with total revenue in the last 30 days. If you join customers to orders and order_items directly, you can easily multiply order-level amounts by the number of items per order (or multiply customer rows by the number of orders). The safe pattern is: aggregate to one row per customer first, then join.

Example: revenue per customer in the last 30 days (pre-aggregate then join)

Step 1: inner query alone (validate grain and uniqueness)

SELECT o.customer_id, SUM(o.order_total) AS revenue_30d FROM orders o WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 day' GROUP BY o.customer_id;

Grain check: one row per customer_id. Uniqueness is guaranteed by GROUP BY customer_id.

Step 2: join the derived table to customers

SELECT c.customer_id, c.customer_name, c.segment, r.revenue_30d FROM customers c LEFT JOIN ( SELECT o.customer_id, SUM(o.order_total) AS revenue_30d FROM orders o WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 day' GROUP BY o.customer_id ) r ON r.customer_id = c.customer_id;

Why LEFT JOIN: keeps customers with no orders in the last 30 days (their revenue_30d will be NULL). Choose INNER JOIN if you only want customers who purchased.

Example: last purchase date per customer (pre-aggregate then join)

Step 1: inner query

SELECT o.customer_id, MAX(o.order_date) AS last_order_date FROM orders o GROUP BY o.customer_id;

Step 2: join to customers

SELECT c.customer_id, c.customer_name, l.last_order_date FROM customers c LEFT JOIN ( SELECT o.customer_id, MAX(o.order_date) AS last_order_date FROM orders o GROUP BY o.customer_id ) l ON l.customer_id = c.customer_id;

Grain check: one row per customer_id, safe to join to customers (also one row per customer_id).

Putting it together: reusable filters + pre-aggregation

You can combine patterns: use a derived table to pre-aggregate, and use EXISTS to filter to a cohort, without duplicating logic.

Example: show only customers who purchased in the last 30 days, with their 30-day revenue

Step 1: inner aggregation query (one row per customer)

SELECT o.customer_id, SUM(o.order_total) AS revenue_30d FROM orders o WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 day' GROUP BY o.customer_id;

Step 2: outer query joins the metric and applies an EXISTS cohort filter

SELECT c.customer_id, c.customer_name, r.revenue_30d FROM customers c JOIN ( SELECT o.customer_id, SUM(o.order_total) AS revenue_30d FROM orders o WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 day' GROUP BY o.customer_id ) r ON r.customer_id = c.customer_id WHERE EXISTS ( SELECT 1 FROM orders o2 WHERE o2.customer_id = c.customer_id AND o2.order_date >= CURRENT_DATE - INTERVAL '30 day' );

Note: in this specific case, the EXISTS is redundant because the JOIN to r already enforces “has revenue_30d,” but the pattern is useful when your metric table and your cohort definition are not identical (e.g., cohort is “bought category A,” metric is “total revenue”).

QA checklist for subqueries in BI

Does the inner query run on its own? Save it as a snippet you can reuse.
What is the grain of the inner result? Write it down: “one row per customer,” “one row per customer per month,” etc.
Is the join key unique in the derived table? If not, you may multiply rows in the outer query.
Does the filter mean “in a set” (IN) or “at least one match” (EXISTS)? Choose the one that matches the business question.
Do counts change unexpectedly after integrating? Compare row counts before/after adding the subquery and spot-check a few IDs.

Now answer the exercise about the content:

In a BI query, what is the safest way to add a 30-day revenue metric to a customer report without accidentally multiplying rows?

You are right! Congratulations, now go to the next page

You missed! Try again.

Pre-aggregating to one row per customer creates the correct grain and a unique join key, preventing row multiplication when joining detailed tables. Then you can safely join the derived table to customers.