Why “query patterns” matter
Once you know how to retrieve, filter, join, and summarize data, the next step is learning repeatable patterns that solve common real-world questions reliably. A “query pattern” is a recognizable structure you can adapt: find the latest record per group, detect missing rows, compare periods, pick top-N per category, or safely handle optional relationships. This chapter focuses on practical patterns and the mistakes that frequently produce wrong results or slow queries.
Pattern: Find the “latest” row per entity (greatest-n-per-group)
A very common requirement is “for each customer, show their most recent order” or “for each device, show the latest status.” The pitfall is trying to use an aggregate like MAX(date) and also select non-aggregated columns from the same table without a reliable method to tie them together. You might accidentally pair the maximum date with the wrong row’s other columns.
Approach A: Window function with ROW_NUMBER()
If your database supports window functions (most modern ones do), this is usually the cleanest approach. The idea is: rank rows within each group, then keep rank 1.
SELECT customer_id, order_id, order_date, total_amount
FROM (
SELECT o.*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC, order_id DESC
) AS rn
FROM orders o
) ranked
WHERE rn = 1;Step-by-step:
- Partition rows by the entity you want one row per (here: customer_id).
- Order within each partition so the “latest” row comes first (order_date DESC). Add a tie-breaker (order_id DESC) so the result is deterministic when dates match.
- ROW_NUMBER() assigns 1 to the first row in each partition.
- Filter to rn = 1 to keep only the latest row per customer.
Common pitfalls:
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
- No tie-breaker: If two orders share the same order_date, the database may return either one. Add a stable tie-breaker column.
- Filtering too early: If you filter orders before ranking (for example, only “shipped” orders), you are changing what “latest” means. That might be correct, but be explicit about it.
Approach B: Correlated subquery (portable but can be slower)
When window functions are not available, you can use a correlated subquery to keep rows whose date equals the maximum date for that customer.
SELECT o.customer_id, o.order_id, o.order_date, o.total_amount
FROM orders o
WHERE o.order_date = (
SELECT MAX(o2.order_date)
FROM orders o2
WHERE o2.customer_id = o.customer_id
);Pitfall: If multiple orders share the same maximum date, you will get multiple rows per customer. If you truly need one row, you must add a tie-breaker (often by using a second condition on MAX(order_id) among those dates) or switch to a window function.
Pattern: Pick “top N per group” (e.g., top 3 products per category)
Another frequent question is “top 3 items in each category by sales.” A global LIMIT returns top N overall, not per group. The correct pattern is to rank within each group.
SELECT category_id, product_id, sales_amount
FROM (
SELECT s.category_id, s.product_id, s.sales_amount,
ROW_NUMBER() OVER (
PARTITION BY s.category_id
ORDER BY s.sales_amount DESC, s.product_id
) AS rn
FROM product_sales s
) x
WHERE rn <= 3
ORDER BY category_id, rn;Common pitfalls:
- Using LIMIT without partitioning: LIMIT 3 gives only three rows total.
- Choosing ROW_NUMBER vs RANK vs DENSE_RANK: ROW_NUMBER gives exactly N rows per group. RANK/DENSE_RANK can return more than N rows when there are ties. Decide based on business rules.
Pattern: Anti-join (find missing related rows)
“Show customers who have never placed an order” or “products with no inventory record” are classic “missing relationship” questions. The pitfall is using NOT IN with NULLs, which can produce unexpected empty results.
Preferred approach: LEFT JOIN ... WHERE right_side IS NULL
SELECT c.customer_id, c.customer_name
FROM customers c
LEFT JOIN orders o
ON o.customer_id = c.customer_id
WHERE o.customer_id IS NULL;Step-by-step:
- LEFT JOIN keeps all customers.
- Customers with no matching orders get NULLs for order columns.
- Filter to rows where the joined key is NULL to keep only “no match” cases.
Alternative: NOT EXISTS (often robust and clear)
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);Common pitfall: NOT IN with NULLs
If the subquery returns any NULL, NOT IN can behave in a way that filters out everything because comparisons become “unknown.” If you must use NOT IN, ensure the subquery cannot return NULL, but in practice NOT EXISTS or anti-join is safer.
Pattern: Semi-join (filter to rows that have a match, without duplicates)
Sometimes you want “customers who have at least one order,” but you do not want one row per order. Joining customers to orders and selecting customers can create duplicates. A semi-join pattern avoids that.
Use EXISTS to avoid duplicates
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);Common pitfall: Using an inner join and then forgetting to de-duplicate with DISTINCT. DISTINCT can hide problems and add cost; EXISTS expresses the intent directly.
Pattern: Handle optional relationships without losing rows
In real schemas, some relationships are optional: an order may or may not have a coupon, a user may or may not have a profile row, a ticket may or may not be assigned. The pitfall is using an inner join when you actually need all base rows.
Use LEFT JOIN and keep filters in the ON clause when appropriate
Suppose you want all orders, and coupon details only when the coupon is active. If you put the “active” filter in WHERE, you turn the LEFT JOIN into an inner join and lose orders without active coupons.
SELECT o.order_id, o.order_date, c.coupon_code
FROM orders o
LEFT JOIN coupons c
ON c.coupon_id = o.coupon_id
AND c.is_active = 1;Common pitfall:
- Filtering joined table in WHERE:
WHERE c.is_active = 1removes rows where c is NULL, which defeats the purpose of LEFT JOIN.
Pattern: Compare a row to the “previous” row (running comparisons)
Typical tasks: detect changes in status over time, compute difference from previous day, or identify gaps in sequences. Window functions such as LAG are designed for this.
Use LAG to compare to the prior record
SELECT device_id, event_time, status,
LAG(status) OVER (
PARTITION BY device_id
ORDER BY event_time
) AS prev_status
FROM device_events;To keep only rows where the status changed:
SELECT *
FROM (
SELECT device_id, event_time, status,
LAG(status) OVER (
PARTITION BY device_id
ORDER BY event_time
) AS prev_status
FROM device_events
) x
WHERE prev_status IS NULL OR status <> prev_status;Common pitfalls:
- Wrong ordering: LAG depends on ORDER BY. If you order by a non-unique timestamp without a tie-breaker, results can be unstable.
- NULL comparisons:
status <> prev_statusis not true when one side is NULL. Handle NULL explicitly if needed.
Pattern: Running totals and cumulative metrics
Running totals are useful for dashboards: cumulative revenue by day, running count of signups, or account balance over time. The pitfall is accidentally computing a total over the entire table when you meant “up to this row.”
Use SUM() OVER with an ordered window
SELECT order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_revenue
FROM daily_sales
ORDER BY order_date;Common pitfalls:
- Missing ORDER BY in the window: Without it, you get the same total on every row.
- Wrong frame: Some databases default to a frame that can behave differently with duplicates in the ORDER BY column. If you want strict “row-by-row” accumulation, specify the frame explicitly as shown.
Pattern: Pagination that doesn’t skip or duplicate rows
When displaying results in pages (page 1, page 2, etc.), a naive approach uses OFFSET/LIMIT. It works for small datasets, but it can become slow for large offsets and can produce duplicates or missing rows if the underlying data changes between requests or if ordering is not deterministic.
Keyset pagination (seek method)
Instead of “skip N rows,” you say “give me the next rows after the last one I saw,” using a stable ordering key.
First page:
SELECT order_id, order_date, total_amount
FROM orders
ORDER BY order_date DESC, order_id DESC
FETCH FIRST 20 ROWS ONLY;Next page (using the last row from the previous page):
SELECT order_id, order_date, total_amount
FROM orders
WHERE (order_date < :last_order_date)
OR (order_date = :last_order_date AND order_id < :last_order_id)
ORDER BY order_date DESC, order_id DESC
FETCH FIRST 20 ROWS ONLY;Common pitfalls:
- Non-deterministic ordering: Ordering only by order_date can shuffle rows with the same date between pages. Add a unique tie-breaker like order_id.
- Using OFFSET for deep pages: OFFSET 100000 can be expensive because the database still has to walk past many rows.
Pattern: Safe searching with wildcards and escaping
Searching text with LIKE is common, but there are pitfalls: leading wildcards can prevent index usage, and user input can accidentally contain wildcard characters that change meaning.
Be explicit about the match type
- Prefix match:
name LIKE 'Ann%'often performs better than'%Ann%'. - Contains match:
name LIKE '%Ann%'is flexible but can be slower.
Escape wildcard characters when input is literal
If a user searches for 50% off, the % is a wildcard unless escaped. Many databases support an ESCAPE clause.
SELECT product_id, product_name
FROM products
WHERE product_name LIKE :pattern ESCAPE '\';Where your application builds :pattern by escaping % and _ (and the escape character itself) and then adding wildcards as intended.
Common pitfall: Implicit type conversions and “mysterious” mismatches
Queries can silently convert types: comparing a numeric column to a string literal, comparing dates to strings, or joining columns of different types. This can lead to wrong results or prevent indexes from being used.
Examples of risky comparisons
-- Risky: comparing integer to string
WHERE customer_id = '00123'
-- Risky: comparing date/timestamp to string
WHERE order_date >= '2026-01-01'Better: use properly typed parameters (from your application) or explicit casts in SQL when needed. The exact cast syntax varies by database, but the principle is consistent: make types match intentionally.
Common pitfall: NULL logic surprises
NULL means “unknown,” and it changes how comparisons work. Beginners often expect NULL to behave like an empty string or zero, but it does not.
Typical surprises
col = NULLis never true. Usecol IS NULL.col <> 'X'does not return rows where col is NULL, because the result is unknown, not true.
Practical pattern: treat NULLs explicitly for comparisons
If you want to treat NULL as a specific value for comparison, use a coalescing function (often COALESCE).
SELECT user_id
FROM users
WHERE COALESCE(status, 'unknown') <> 'disabled';Be careful: Coalescing can change meaning. Make sure “unknown” truly belongs in the “not disabled” set for your business logic.
Common pitfall: Accidental row multiplication in joins
Even when you understand joins, it is easy to accidentally multiply rows when joining to a table that has multiple matches per key (for example, joining customers to addresses when customers can have multiple addresses). The result looks “too big,” and totals become inflated.
How to detect it
- Compare row counts before and after the join.
- Look for keys that are not unique on the joined side.
- Check whether you intended “one-to-many” or “one-to-one.”
Practical fixes
- If you only need one related row (e.g., “primary address”), filter to that row before joining (for example, by ranking addresses per customer and keeping rn = 1).
- If you need aggregated info from the many-side (e.g., count of addresses), compute it first and then join the aggregated result to avoid multiplying base rows.
Common pitfall: Using DISTINCT as a band-aid
DISTINCT can remove duplicates, but it can also hide a logic error (like an unintended many-to-many join). It may also be expensive on large result sets. Prefer to fix the join logic or use EXISTS when you only need to test for presence.
Common pitfall: Non-deterministic results without stable ordering
If you do not specify ORDER BY, the database is free to return rows in any order. Even with ORDER BY, if the ordering columns are not unique, the relative order of ties can change between runs.
Practical rule
Whenever you page results, pick “latest,” or compute “first/last,” include a unique tie-breaker in ORDER BY (often the primary key).
Common pitfall: Counting the wrong thing
Counting seems simple, but it is easy to count rows when you meant to count entities, or to count a column that can be NULL.
Practical reminders
COUNT(*)counts rows.COUNT(column)counts non-NULL values in that column.- If a join multiplies rows, counts can inflate unless you count distinct entity keys or restructure the query.
Example: count customers who ordered at least once (without duplicates):
SELECT COUNT(*) AS customers_with_orders
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);Common pitfall: Time boundaries and inclusive/exclusive ranges
Date/time filtering often goes wrong at boundaries. For example, filtering “orders on 2026-01-13” can miss rows if order_date includes time. A safer pattern is to use a half-open interval: start inclusive, end exclusive.
WHERE order_timestamp >= :start_ts
AND order_timestamp < :end_tsStep-by-step:
- Compute start_ts as the beginning of the period.
- Compute end_ts as the beginning of the next period.
- Use >= for start and < for end to avoid double-counting boundary values.
Common pitfall: Performance surprises from functions on indexed columns
Applying a function to a column in a predicate can prevent index usage in many databases. For example, wrapping a timestamp column in a date-extraction function inside WHERE can force a scan.
Prefer range predicates over transforming the column
Instead of:
WHERE DATE(order_timestamp) = :some_datePrefer:
WHERE order_timestamp >= :some_date_start
AND order_timestamp < :next_date_startThis keeps the column “as-is” and often allows the database to use an index on order_timestamp.
Practical workflow: Debugging a query that “looks right” but is wrong
1) Reduce the problem
Start by selecting only the key columns and the columns involved in the logic. Remove extra joins and computed columns temporarily. This makes it easier to see where duplication or filtering changes the result.
2) Validate assumptions with small checks
- Check uniqueness on join keys by counting matches per key on the joined table.
- Check for NULLs in columns used in joins or filters.
- Check for ties in “latest” logic (same timestamp for multiple rows).
3) Build in layers
Use subqueries (or common table expressions, if you prefer) to create intermediate result sets: one step to rank, another to filter, another to join. Layering makes intent clearer and reduces the chance of mixing incompatible operations in one SELECT.
4) Make ordering deterministic
Whenever you use “first,” “last,” pagination, or window functions, ensure ORDER BY includes a unique tie-breaker so results are stable and testable.