All courses > Technology and Programming > Databases ::

Summarizing Data with Aggregations and Grouping

Capítulo 7

Estimated reading time: 12 minutes

Why Aggregations and Grouping Matter

When you look at raw rows in a table, you often see “events” or “transactions”: one order, one payment, one page view, one sensor reading. Many business questions are not about individual rows, but about summaries: total revenue per month, average delivery time per warehouse, number of customers per country, highest and lowest invoice values, and so on.

SQL aggregations let you turn many rows into fewer rows by calculating summary values. Grouping lets you decide which rows belong together (for example, “all orders for the same customer” or “all sales in the same month”). In practice, you will use aggregates constantly for reporting, dashboards, data checks, and exploratory analysis.

Core Aggregate Functions

Aggregate functions compute a single value from multiple rows. The most common ones are:

COUNT: how many rows (or how many non-NULL values) exist
SUM: total of numeric values
AVG: average of numeric values
MIN and MAX: smallest and largest values

These functions ignore NULL values (except COUNT(*), which counts rows regardless of NULLs). This NULL behavior is important: if a column can be missing, your totals and averages may not behave the way you expect unless you handle NULLs explicitly.

COUNT: Counting Rows vs Counting Values

COUNT(*) counts rows. COUNT(column_name) counts only rows where that column is not NULL. This difference is useful when you want to count “known values” rather than “rows that exist.”

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

-- Count all orders (rows), even if some fields are NULL in those rows
SELECT COUNT(*) AS order_count
FROM orders;

-- Count orders that have a shipped_at timestamp (non-NULL)
SELECT COUNT(shipped_at) AS shipped_count
FROM orders;

If shipped_at is NULL until an order ships, then COUNT(shipped_at) gives you “how many shipped,” while COUNT(*) gives you “how many total.”

SUM and AVG: Numeric Summaries

SUM adds values. AVG computes the mean. Both ignore NULLs. If you want NULLs to behave like zero, you must convert them (for example, with COALESCE), but be careful: treating missing as zero can change the meaning of the result.

-- Total revenue from all paid invoices
SELECT SUM(total_amount) AS total_revenue
FROM invoices
WHERE status = 'PAID';

-- Average order value for completed orders
SELECT AVG(order_total) AS avg_order_value
FROM orders
WHERE status = 'COMPLETED';

Many databases return decimals for AVG, but some may return an integer if the input is integer. If you see unexpected rounding, cast the input to a decimal type before averaging.

MIN and MAX: Extremes

MIN and MAX work on numbers, dates, and text (text comparisons depend on collation rules). They are useful for ranges, sanity checks, and “latest/earliest” style questions.

-- Earliest and latest order dates
SELECT MIN(order_date) AS first_order_date,
       MAX(order_date) AS last_order_date
FROM orders;

Aggregations Without GROUP BY (Single Summary Row)

If you use aggregate functions without GROUP BY, the entire filtered result set becomes one group, and you get one summary row.

-- One row: total orders and total revenue for 2025
SELECT COUNT(*) AS orders_2025,
       SUM(order_total) AS revenue_2025
FROM orders
WHERE order_date >= '2025-01-01'
  AND order_date <  '2026-01-01';

This pattern is common for KPI tiles: “Total sales this month,” “Active users today,” “Tickets opened this week.”

Grouping with GROUP BY (Multiple Summary Rows)

GROUP BY splits rows into groups based on one or more columns, then computes aggregates per group. Each distinct combination of the grouped columns becomes one output row.

-- Orders per status
SELECT status,
       COUNT(*) AS order_count
FROM orders
GROUP BY status;

Here, all rows with the same status are grouped together, and COUNT(*) is computed for each group.

Rule: Every Selected Non-Aggregated Column Must Be Grouped

In a grouped query, every column in the SELECT list must be either:

Inside an aggregate function (like SUM(amount)), or
Listed in the GROUP BY

If you try to select a non-aggregated column that is not in GROUP BY, most databases will raise an error because the value would be ambiguous (which row’s value should be shown?).

-- This is invalid in most SQL databases
SELECT status, order_date, COUNT(*)
FROM orders
GROUP BY status;

To fix it, either group by both columns (which changes the grouping) or aggregate order_date (for example, MIN(order_date) or MAX(order_date)) depending on what you mean.

Grouping by Multiple Columns

You can group by more than one column to create more detailed summaries.

-- Orders per status per sales_channel
SELECT status,
       sales_channel,
       COUNT(*) AS order_count
FROM orders
GROUP BY status, sales_channel;

Now each output row represents a unique combination of status and sales_channel.

Practical Step-by-Step: Build a Summary Report

Imagine you have an orders table with columns like order_id, customer_id, order_date, status, and order_total. You want a report showing, per month, how many completed orders happened and how much revenue they produced.

Step 1: Decide the grain (what each output row represents)

We want one row per month. That means our grouping key should represent a month. Different databases have different date functions; a common approach is to group by a month “bucket” derived from order_date.

If your database supports a month-truncation function, you might use it. If not, you can group by year and month extracted separately. The idea is the same: create a consistent month identifier.

-- Example approach: group by year and month extracted from order_date
SELECT EXTRACT(YEAR FROM order_date)  AS order_year,
       EXTRACT(MONTH FROM order_date) AS order_month,
       COUNT(*) AS completed_orders,
       SUM(order_total) AS completed_revenue
FROM orders
WHERE status = 'COMPLETED'
GROUP BY EXTRACT(YEAR FROM order_date),
         EXTRACT(MONTH FROM order_date)
ORDER BY order_year, order_month;

This produces one row per year-month, with counts and revenue totals for completed orders.

Step 2: Add additional metrics carefully

Suppose you also want average order value and the largest order in each month. You can add AVG and MAX without changing the grouping.

SELECT EXTRACT(YEAR FROM order_date)  AS order_year,
       EXTRACT(MONTH FROM order_date) AS order_month,
       COUNT(*) AS completed_orders,
       SUM(order_total) AS completed_revenue,
       AVG(order_total) AS avg_order_value,
       MAX(order_total) AS largest_order
FROM orders
WHERE status = 'COMPLETED'
GROUP BY EXTRACT(YEAR FROM order_date),
         EXTRACT(MONTH FROM order_date)
ORDER BY order_year, order_month;

All these aggregates are computed within each month group.

Step 3: Validate results with quick checks

When you build summary queries, it is easy to accidentally change the grouping or filter. A simple validation approach is to compare totals:

The sum of monthly completed_orders should equal the overall count of completed orders for the same period.
The sum of monthly completed_revenue should equal the overall revenue for the same period.

-- Overall totals for the same filter
SELECT COUNT(*) AS completed_orders,
       SUM(order_total) AS completed_revenue
FROM orders
WHERE status = 'COMPLETED';

If the totals do not match, you may have grouped by the wrong key, filtered differently, or introduced duplicates upstream.

Filtering Groups with HAVING

WHERE filters rows before grouping. HAVING filters groups after aggregates are computed. Use HAVING when the condition depends on an aggregate value such as COUNT or SUM.

-- Find customers with at least 5 completed orders
SELECT customer_id,
       COUNT(*) AS completed_orders
FROM orders
WHERE status = 'COMPLETED'
GROUP BY customer_id
HAVING COUNT(*) >= 5;

Notice the division of responsibilities:

WHERE status = 'COMPLETED' decides which rows are eligible.
GROUP BY customer_id defines the groups.
HAVING COUNT(*) >= 5 removes groups that do not meet the threshold.

Many beginners try to write WHERE COUNT(*) >= 5, which does not work because WHERE runs before aggregation.

HAVING with Multiple Conditions

You can combine multiple aggregate conditions in HAVING.

-- Products that sold at least 100 units and generated at least 10,000 in revenue
SELECT product_id,
       SUM(quantity) AS units_sold,
       SUM(quantity * unit_price) AS revenue
FROM order_items
GROUP BY product_id
HAVING SUM(quantity) >= 100
   AND SUM(quantity * unit_price) >= 10000;

Counting Distinct Values

Sometimes you want to count unique values rather than rows. For example, “How many distinct customers placed an order this month?” That is different from “How many orders were placed?” Use COUNT(DISTINCT ...).

-- Distinct customers who ordered
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE order_date >= '2025-01-01'
  AND order_date <  '2025-02-01';

Be aware that COUNT(DISTINCT ...) can be more expensive than COUNT(*) on large datasets, because the database must deduplicate values.

NULLs and Aggregations: Common Pitfalls

COUNT and NULL

COUNT(column) ignores NULLs. This is useful, but it can surprise you if you assume it counts rows. If you need “rows,” use COUNT(*).

SUM/AVG and NULL

SUM and AVG ignore NULLs. If a column is NULL because the value is unknown, ignoring it might be correct. If a column is NULL because “no charge” should be treated as zero, you may want to convert NULL to 0.

-- Treat missing discount as 0 for the purpose of total discount
SELECT SUM(COALESCE(discount_amount, 0)) AS total_discount
FROM orders;

Use this carefully: replacing NULL with 0 changes the meaning of averages (it increases the number of values included).

Grouping by Derived Values (Buckets)

Grouping is not limited to raw columns. You can group by expressions that create categories, such as price ranges, time buckets, or “high/medium/low” labels.

Example: Group Orders into Value Bands

SELECT CASE
         WHEN order_total < 50 THEN 'LOW'
         WHEN order_total < 200 THEN 'MEDIUM'
         ELSE 'HIGH'
       END AS order_value_band,
       COUNT(*) AS orders,
       AVG(order_total) AS avg_value
FROM orders
WHERE status = 'COMPLETED'
GROUP BY CASE
           WHEN order_total < 50 THEN 'LOW'
           WHEN order_total < 200 THEN 'MEDIUM'
           ELSE 'HIGH'
         END
ORDER BY orders DESC;

Many databases let you reference the alias (order_value_band) in GROUP BY, but not all do. The most portable approach is to repeat the expression in GROUP BY as shown.

Ordering Grouped Results and Finding “Top N” Groups

After grouping, you often want to sort groups by an aggregate metric, such as highest revenue or most orders. You can order by an aggregate expression or its alias.

-- Top 10 customers by revenue
SELECT customer_id,
       SUM(order_total) AS revenue
FROM orders
WHERE status = 'COMPLETED'
GROUP BY customer_id
ORDER BY revenue DESC
FETCH FIRST 10 ROWS ONLY;

Different databases use different syntax for limiting rows (for example, LIMIT, TOP, or FETCH FIRST). The key idea is the same: group first, then sort by the aggregate, then take the top results.

Multiple Aggregates in One Query (and Why It’s Efficient)

A common beginner mistake is to run many separate queries: one for count, one for sum, one for average. SQL can compute multiple aggregates in a single pass over the data, which is usually more efficient and keeps metrics consistent because they are computed from the same filtered set.

-- Several KPIs in one query
SELECT COUNT(*) AS orders,
       COUNT(DISTINCT customer_id) AS unique_customers,
       SUM(order_total) AS revenue,
       AVG(order_total) AS avg_order_value,
       MIN(order_total) AS smallest_order,
       MAX(order_total) AS largest_order
FROM orders
WHERE status = 'COMPLETED'
  AND order_date >= '2025-01-01'
  AND order_date <  '2026-01-01';

This returns one row with multiple metrics that are aligned by the same filters.

Advanced Pattern: Conditional Aggregation

Conditional aggregation means computing different metrics for different conditions within the same grouped query. This is extremely useful for building pivot-like summaries without special pivot features.

The idea is: inside SUM or COUNT, use a CASE expression that returns a value only when a condition is met.

Example: Count Orders by Status per Month

SELECT EXTRACT(YEAR FROM order_date)  AS order_year,
       EXTRACT(MONTH FROM order_date) AS order_month,
       SUM(CASE WHEN status = 'COMPLETED' THEN 1 ELSE 0 END) AS completed_orders,
       SUM(CASE WHEN status = 'CANCELLED' THEN 1 ELSE 0 END) AS cancelled_orders,
       SUM(CASE WHEN status = 'PENDING'   THEN 1 ELSE 0 END) AS pending_orders
FROM orders
GROUP BY EXTRACT(YEAR FROM order_date),
         EXTRACT(MONTH FROM order_date)
ORDER BY order_year, order_month;

This produces one row per month with separate columns for each status count. It is often easier to chart than a “tall” result with one row per status per month.

Example: Revenue Split by Channel

SELECT EXTRACT(YEAR FROM order_date)  AS order_year,
       EXTRACT(MONTH FROM order_date) AS order_month,
       SUM(CASE WHEN sales_channel = 'WEB'   AND status = 'COMPLETED' THEN order_total ELSE 0 END) AS web_revenue,
       SUM(CASE WHEN sales_channel = 'STORE' AND status = 'COMPLETED' THEN order_total ELSE 0 END) AS store_revenue
FROM orders
GROUP BY EXTRACT(YEAR FROM order_date),
         EXTRACT(MONTH FROM order_date)
ORDER BY order_year, order_month;

Because the CASE returns 0 for rows that do not match, the SUM still works and you avoid NULL totals.

Common Mistakes and How to Avoid Them

Mixing Row Filters and Group Filters

If you want to exclude rows before grouping, use WHERE. If you want to exclude groups based on aggregate results, use HAVING. A practical way to remember: WHERE talks about columns in individual rows; HAVING talks about aggregated results.

Accidentally Changing the Grouping Level

Adding a column to SELECT often forces you to add it to GROUP BY, which changes the meaning of the result. Before adding a column, ask: “Do I want one row per (existing group key + this new column)?” If not, you probably need an aggregate (like MIN/MAX) or a different query structure.

Assuming Aggregates Include NULLs

If missing values matter, decide explicitly how to treat them. For example, if a missing tax_amount should be treated as 0, use COALESCE(tax_amount, 0) inside the aggregate. If missing means “unknown,” leaving it as NULL and letting aggregates ignore it may be more appropriate.

Forgetting That GROUP BY Happens After WHERE

If you filter out rows in WHERE, they are gone before grouping. This is usually what you want, but it can surprise you when you try to compute “total orders” and “completed orders” in the same grouped query. In that case, conditional aggregation is often the right tool because it keeps all rows and counts different subsets in separate columns.

Now answer the exercise about the content:

In a query that groups results by customer_id, how do you correctly keep only customers who have at least 5 completed orders?

You are right! Congratulations, now go to the next page

You missed! Try again.

WHERE filters individual rows before grouping (for example, status = COMPLETED). HAVING filters groups after aggregates are computed, so HAVING COUNT(*) >= 5 keeps only customer groups meeting the threshold.