All courses > >

SQL for Business Intelligence: Building Dashboard-Ready Datasets and Validating Results

Capítulo 11

Estimated reading time: 12 minutes

From analysis query to dashboard-ready dataset

A dashboard-ready dataset is a query output designed to be consumed repeatedly with minimal interpretation. It should have stable dimensions (fields used to slice/filter), clearly defined metrics (fields used as numbers), and consistent date handling (so time series align across reports). The goal is not just to “get the right number once,” but to produce a dependable table-like output that stays correct as data grows and business users slice it in different ways.

Design the final dataset: grain, dimensions, metrics, dates

Before writing SQL, define the dataset contract. This prevents accidental duplication, missing rows, or shifting definitions when someone adds a new join.

Grain (one row represents…): e.g., “one row per day, per product category, per region.” The grain determines what keys must be unique in the final output.
Stable dimensions: dimensions should be consistent, slowly changing when possible, and sourced from a clear system of record (e.g., region from a customer dimension table, not inferred from free-text fields).
Metrics: define each metric precisely (what is included/excluded, which statuses count, whether refunds reduce revenue, etc.). Avoid mixing different grains in the same metric (e.g., order-level revenue with line-item-level counts) unless you pre-aggregate appropriately.
Continue in our app.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Or continue reading below...
Download the app
Date handling: choose a single “reporting date” per metric (order_date vs ship_date vs invoice_date), define timezone rules, and ensure consistent truncation to day/week/month.

Step-by-step: build a dashboard dataset with controlled joins

The pattern below builds a daily sales dataset at a fixed grain. It uses pre-aggregation to protect the grain, then joins to dimensions that are guaranteed to be one-to-one at the join key.

Step 1: define parameters and the reporting calendar

Dashboards often need a complete date spine so that days with zero activity still appear. If you have a calendar table, use it. If not, you can still build a dataset without it, but missing dates will not show as zeros.

/* Dataset: daily_sales_dashboard_v1 (one row per day x category x region) */
/* Reporting date: order_placed_at converted to business timezone, truncated to day */
/* Revenue definition: sum of paid order item amounts, excluding cancelled items */

WITH params AS (
  SELECT
    DATE '2025-01-01' AS start_date,
    DATE '2025-01-31' AS end_date
),
calendar AS (
  SELECT c.calendar_date AS report_date
  FROM dim_calendar c
  JOIN params p
    ON c.calendar_date BETWEEN p.start_date AND p.end_date
)

Step 2: pre-aggregate the fact table to the target grain

Pre-aggregate at the intended grain before joining to dimensions that might multiply rows (for example, joining orders to order_items without aggregation will change the grain). If your base data is already at the right grain, this step can be simpler.

, order_items_clean AS (
  SELECT
    oi.order_id,
    oi.product_id,
    oi.quantity,
    oi.extended_amount,
    oi.item_status
  FROM fct_order_items oi
  WHERE oi.item_status NOT IN ('CANCELLED')
),
orders_clean AS (
  SELECT
    o.order_id,
    o.customer_id,
    /* Example timezone handling; adjust to your warehouse functions */
    CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE) AS report_date,
    o.order_status
  FROM fct_orders o
  WHERE o.order_status IN ('PAID','FULFILLED')
),
base_fact AS (
  /* Pre-aggregate to day x product x customer (intermediate grain) */
  SELECT
    oc.report_date,
    oc.customer_id,
    oic.product_id,
    SUM(oic.quantity) AS units,
    SUM(oic.extended_amount) AS revenue
  FROM orders_clean oc
  JOIN order_items_clean oic
    ON oc.order_id = oic.order_id
  GROUP BY
    oc.report_date,
    oc.customer_id,
    oic.product_id
)

Step 3: attach stable dimensions with one-to-one joins

Join to dimensions that are unique on their keys. If a dimension is not unique (for example, multiple active rows per product_id), fix it upstream or choose a single row deterministically before joining.

, dim_product_one AS (
  SELECT
    p.product_id,
    p.category,
    p.brand
  FROM dim_product p
  WHERE p.is_current = TRUE
),
dim_customer_one AS (
  SELECT
    c.customer_id,
    c.region,
    c.segment
  FROM dim_customer c
  WHERE c.is_current = TRUE
),
fact_enriched AS (
  SELECT
    bf.report_date,
    dp.category,
    dc.region,
    SUM(bf.units) AS units,
    SUM(bf.revenue) AS revenue,
    COUNT(DISTINCT bf.customer_id) AS active_customers
  FROM base_fact bf
  JOIN dim_product_one dp
    ON bf.product_id = dp.product_id
  JOIN dim_customer_one dc
    ON bf.customer_id = dc.customer_id
  GROUP BY
    bf.report_date,
    dp.category,
    dc.region
)

Step 4: finalize with a date spine and consistent zeros

Left join the aggregated facts to the calendar so every date appears. Use consistent null-to-zero handling for metrics so BI tools don’t misinterpret missing rows.

, final_dataset AS (
  SELECT
    cal.report_date,
    fe.category,
    fe.region,
    COALESCE(fe.units, 0) AS units,
    COALESCE(fe.revenue, 0) AS revenue,
    COALESCE(fe.active_customers, 0) AS active_customers
  FROM calendar cal
  LEFT JOIN fact_enriched fe
    ON cal.report_date = fe.report_date
)
SELECT *
FROM final_dataset

Note: If you need a complete grid of date x category x region (including combinations with no sales), you must generate that grid explicitly (date spine × dimension values) and then left join facts onto it. Otherwise, only combinations that exist in the fact_enriched table will appear for each date.

Consistent date handling for BI

Date inconsistencies are a common reason dashboards disagree. Establish a single rule per dataset and apply it everywhere in the query.

Choose the business timezone: convert timestamps before truncating to date. Truncating in UTC and then converting can shift transactions across days.
Define the reporting date per metric: revenue might be based on payment date, while shipments use ship date. If you mix them, label them explicitly (report_date_paid, report_date_shipped) or build separate datasets.
Use a calendar dimension: it standardizes week starts, fiscal periods, holidays, and makes “missing days” visible.

Validation techniques: prove the dataset is correct

Validation is not a one-time step. Treat it as part of building: every time you add a join, filter, or metric, run checks that would catch row multiplication, missing records, or definition drift.

1) Reconcile totals to the source system (control totals)

Pick a small set of “control totals” that should match a trusted source (or a simpler query). Reconcile at multiple levels: overall, by day, and by a major dimension (e.g., region).

/* Control total: revenue by day from the simplest trusted logic */
WITH trusted AS (
  SELECT
    CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE) AS report_date,
    SUM(oi.extended_amount) AS trusted_revenue
  FROM fct_orders o
  JOIN fct_order_items oi
    ON o.order_id = oi.order_id
  WHERE o.order_status IN ('PAID','FULFILLED')
    AND oi.item_status NOT IN ('CANCELLED')
    AND CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE)
      BETWEEN DATE '2025-01-01' AND DATE '2025-01-31'
  GROUP BY CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE)
),
dashboard AS (
  SELECT report_date, SUM(revenue) AS dashboard_revenue
  FROM daily_sales_dashboard_v1
  WHERE report_date BETWEEN DATE '2025-01-01' AND DATE '2025-01-31'
  GROUP BY report_date
)
SELECT
  t.report_date,
  t.trusted_revenue,
  d.dashboard_revenue,
  (d.dashboard_revenue - t.trusted_revenue) AS diff
FROM trusted t
JOIN dashboard d
  ON t.report_date = d.report_date
ORDER BY t.report_date

If differences exist, investigate whether the dataset changed the definition (filters/statuses), changed the date logic (timezone/truncation), or introduced duplication via joins.

2) Compare aggregates across join paths (detect duplication)

A powerful duplication test is to compute the same metric through two logically equivalent paths and compare results. If they differ, a join is multiplying rows or filtering unexpectedly.

/* Path A: revenue from base_fact (already pre-aggregated) */
WITH a AS (
  SELECT report_date, SUM(revenue) AS rev_a
  FROM base_fact
  GROUP BY report_date
),
/* Path B: revenue after joining dimensions (should not change totals) */
 b AS (
  SELECT bf.report_date, SUM(bf.revenue) AS rev_b
  FROM base_fact bf
  JOIN dim_product_one dp
    ON bf.product_id = dp.product_id
  JOIN dim_customer_one dc
    ON bf.customer_id = dc.customer_id
  GROUP BY bf.report_date
)
SELECT
  a.report_date,
  a.rev_a,
  b.rev_b,
  (b.rev_b - a.rev_a) AS diff
FROM a
JOIN b
  ON a.report_date = b.report_date
ORDER BY a.report_date

If rev_b is larger than rev_a, a dimension join is likely one-to-many. If rev_b is smaller, the join is filtering out facts (missing dimension keys, inner join when you needed left join, or is_current filters removing valid keys).

3) Check duplicate keys in the final output (grain enforcement)

Your final dataset should have a unique key that matches the declared grain. Test it directly.

/* Expected unique key: report_date + category + region */
SELECT
  report_date,
  category,
  region,
  COUNT(*) AS row_count
FROM daily_sales_dashboard_v1
GROUP BY report_date, category, region
HAVING COUNT(*) > 1
ORDER BY row_count DESC

If this returns rows, your dataset is not at the grain you think it is. Common causes: joining to a non-unique dimension, including additional hidden dimensions (like brand) without grouping, or mixing customer-level and product-level facts.

4) Sanity-check queries for fast QA

Sanity checks are quick queries you run every time you modify the dataset. They help catch obvious issues early.

Row counts by date: sudden spikes can indicate duplication; sudden drops can indicate filtering.

SELECT
  report_date,
  COUNT(*) AS rows,
  SUM(revenue) AS revenue
FROM daily_sales_dashboard_v1
GROUP BY report_date
ORDER BY report_date

Top contributors: verify the biggest categories/regions look plausible and stable.

SELECT
  category,
  SUM(revenue) AS revenue
FROM daily_sales_dashboard_v1
WHERE report_date BETWEEN DATE '2025-01-01' AND DATE '2025-01-31'
GROUP BY category
ORDER BY revenue DESC
FETCH FIRST 10 ROWS ONLY

Negative or impossible values: catch sign errors, refunds mishandled, or unit counts below zero.

SELECT *
FROM daily_sales_dashboard_v1
WHERE revenue < 0 OR units < 0
ORDER BY report_date

Null dimension coverage: if category or region is null, you may be losing mapping coverage or filtering incorrectly.

SELECT
  SUM(CASE WHEN category IS NULL THEN 1 ELSE 0 END) AS null_category_rows,
  SUM(CASE WHEN region IS NULL THEN 1 ELSE 0 END) AS null_region_rows
FROM daily_sales_dashboard_v1

Documentation inside SQL: definitions, assumptions, and change safety

Dashboard datasets become shared assets. Documenting business definitions and assumptions inside the SQL makes the dataset auditable and reduces “metric drift” when someone edits it later.

What to document in comments

Dataset purpose and grain: “one row per day x category x region.”
Metric definitions: what is included/excluded, status filters, refund handling, currency assumptions.
Date logic: timezone conversion, which timestamp is used, fiscal calendar mapping.
Join expectations: which joins must be one-to-one, and what to do if dimension keys are missing.
Known limitations: e.g., “region is based on current customer profile; historical region changes are not backfilled.”
Validation checklist: the specific control totals and duplicate-key checks you run.

Example: a documented, maintainable dataset query

/* daily_sales_dashboard_v1
   Purpose: Dashboard-ready daily sales metrics for BI slicing by category and region.

   Grain (unique key): report_date, category, region

   Reporting date:
   - report_date = order_placed_at converted to America/New_York, truncated to DATE.

   Revenue definition:
   - Sum of fct_order_items.extended_amount
   - Included orders: order_status IN ('PAID','FULFILLED')
   - Excluded items: item_status = 'CANCELLED'

   Dimension rules:
   - dim_product and dim_customer must be unique on their IDs for is_current = TRUE.
   - If dimension coverage gaps appear, investigate missing keys before changing joins.

   Validation (run on every change):
   1) Reconcile revenue by day to trusted base query for the same date range.
   2) Compare totals before vs after dimension joins (should match).
   3) Duplicate key check on (report_date, category, region) must return zero rows.
*/

WITH params AS (
  SELECT DATE '2025-01-01' AS start_date, DATE '2025-01-31' AS end_date
),
calendar AS (
  SELECT c.calendar_date AS report_date
  FROM dim_calendar c
  JOIN params p
    ON c.calendar_date BETWEEN p.start_date AND p.end_date
),
order_items_clean AS (
  SELECT order_id, product_id, quantity, extended_amount
  FROM fct_order_items
  WHERE item_status <> 'CANCELLED'
),
orders_clean AS (
  SELECT
    order_id,
    customer_id,
    CAST(order_placed_at AT TIME ZONE 'America/New_York' AS DATE) AS report_date
  FROM fct_orders
  WHERE order_status IN ('PAID','FULFILLED')
),
base_fact AS (
  SELECT
    oc.report_date,
    oc.customer_id,
    oic.product_id,
    SUM(oic.quantity) AS units,
    SUM(oic.extended_amount) AS revenue
  FROM orders_clean oc
  JOIN order_items_clean oic
    ON oc.order_id = oic.order_id
  GROUP BY oc.report_date, oc.customer_id, oic.product_id
),
dim_product_one AS (
  SELECT product_id, category
  FROM dim_product
  WHERE is_current = TRUE
),
dim_customer_one AS (
  SELECT customer_id, region
  FROM dim_customer
  WHERE is_current = TRUE
),
fact_enriched AS (
  SELECT
    bf.report_date,
    dp.category,
    dc.region,
    SUM(bf.units) AS units,
    SUM(bf.revenue) AS revenue,
    COUNT(DISTINCT bf.customer_id) AS active_customers
  FROM base_fact bf
  JOIN dim_product_one dp
    ON bf.product_id = dp.product_id
  JOIN dim_customer_one dc
    ON bf.customer_id = dc.customer_id
  GROUP BY bf.report_date, dp.category, dc.region
)
SELECT
  cal.report_date,
  fe.category,
  fe.region,
  COALESCE(fe.units, 0) AS units,
  COALESCE(fe.revenue, 0) AS revenue,
  COALESCE(fe.active_customers, 0) AS active_customers
FROM calendar cal
LEFT JOIN fact_enriched fe
  ON cal.report_date = fe.report_date

Now answer the exercise about the content:

When building a dashboard-ready dataset with a daily date spine, what additional step is required if you want every date to include every category × region combination even when there are no sales?

You are right! Congratulations, now go to the next page

You missed! Try again.

A date spine alone guarantees all dates appear, but not every category/region combination. To show zeros for missing combinations, you must build the full date × category × region grid and left join the aggregated facts onto it.