From analysis query to dashboard-ready dataset
A dashboard-ready dataset is a query output designed to be consumed repeatedly with minimal interpretation. It should have stable dimensions (fields used to slice/filter), clearly defined metrics (fields used as numbers), and consistent date handling (so time series align across reports). The goal is not just to “get the right number once,” but to produce a dependable table-like output that stays correct as data grows and business users slice it in different ways.
Design the final dataset: grain, dimensions, metrics, dates
Before writing SQL, define the dataset contract. This prevents accidental duplication, missing rows, or shifting definitions when someone adds a new join.
Grain (one row represents…): e.g., “one row per day, per product category, per region.” The grain determines what keys must be unique in the final output.
Stable dimensions: dimensions should be consistent, slowly changing when possible, and sourced from a clear system of record (e.g., region from a customer dimension table, not inferred from free-text fields).
Metrics: define each metric precisely (what is included/excluded, which statuses count, whether refunds reduce revenue, etc.). Avoid mixing different grains in the same metric (e.g., order-level revenue with line-item-level counts) unless you pre-aggregate appropriately.
Continue in our app.- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Date handling: choose a single “reporting date” per metric (order_date vs ship_date vs invoice_date), define timezone rules, and ensure consistent truncation to day/week/month.
Step-by-step: build a dashboard dataset with controlled joins
The pattern below builds a daily sales dataset at a fixed grain. It uses pre-aggregation to protect the grain, then joins to dimensions that are guaranteed to be one-to-one at the join key.
Step 1: define parameters and the reporting calendar
Dashboards often need a complete date spine so that days with zero activity still appear. If you have a calendar table, use it. If not, you can still build a dataset without it, but missing dates will not show as zeros.
/* Dataset: daily_sales_dashboard_v1 (one row per day x category x region) */
/* Reporting date: order_placed_at converted to business timezone, truncated to day */
/* Revenue definition: sum of paid order item amounts, excluding cancelled items */
WITH params AS (
SELECT
DATE '2025-01-01' AS start_date,
DATE '2025-01-31' AS end_date
),
calendar AS (
SELECT c.calendar_date AS report_date
FROM dim_calendar c
JOIN params p
ON c.calendar_date BETWEEN p.start_date AND p.end_date
)Step 2: pre-aggregate the fact table to the target grain
Pre-aggregate at the intended grain before joining to dimensions that might multiply rows (for example, joining orders to order_items without aggregation will change the grain). If your base data is already at the right grain, this step can be simpler.
, order_items_clean AS (
SELECT
oi.order_id,
oi.product_id,
oi.quantity,
oi.extended_amount,
oi.item_status
FROM fct_order_items oi
WHERE oi.item_status NOT IN ('CANCELLED')
),
orders_clean AS (
SELECT
o.order_id,
o.customer_id,
/* Example timezone handling; adjust to your warehouse functions */
CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE) AS report_date,
o.order_status
FROM fct_orders o
WHERE o.order_status IN ('PAID','FULFILLED')
),
base_fact AS (
/* Pre-aggregate to day x product x customer (intermediate grain) */
SELECT
oc.report_date,
oc.customer_id,
oic.product_id,
SUM(oic.quantity) AS units,
SUM(oic.extended_amount) AS revenue
FROM orders_clean oc
JOIN order_items_clean oic
ON oc.order_id = oic.order_id
GROUP BY
oc.report_date,
oc.customer_id,
oic.product_id
)Step 3: attach stable dimensions with one-to-one joins
Join to dimensions that are unique on their keys. If a dimension is not unique (for example, multiple active rows per product_id), fix it upstream or choose a single row deterministically before joining.
, dim_product_one AS (
SELECT
p.product_id,
p.category,
p.brand
FROM dim_product p
WHERE p.is_current = TRUE
),
dim_customer_one AS (
SELECT
c.customer_id,
c.region,
c.segment
FROM dim_customer c
WHERE c.is_current = TRUE
),
fact_enriched AS (
SELECT
bf.report_date,
dp.category,
dc.region,
SUM(bf.units) AS units,
SUM(bf.revenue) AS revenue,
COUNT(DISTINCT bf.customer_id) AS active_customers
FROM base_fact bf
JOIN dim_product_one dp
ON bf.product_id = dp.product_id
JOIN dim_customer_one dc
ON bf.customer_id = dc.customer_id
GROUP BY
bf.report_date,
dp.category,
dc.region
)Step 4: finalize with a date spine and consistent zeros
Left join the aggregated facts to the calendar so every date appears. Use consistent null-to-zero handling for metrics so BI tools don’t misinterpret missing rows.
, final_dataset AS (
SELECT
cal.report_date,
fe.category,
fe.region,
COALESCE(fe.units, 0) AS units,
COALESCE(fe.revenue, 0) AS revenue,
COALESCE(fe.active_customers, 0) AS active_customers
FROM calendar cal
LEFT JOIN fact_enriched fe
ON cal.report_date = fe.report_date
)
SELECT *
FROM final_datasetNote: If you need a complete grid of date x category x region (including combinations with no sales), you must generate that grid explicitly (date spine × dimension values) and then left join facts onto it. Otherwise, only combinations that exist in the fact_enriched table will appear for each date.
Consistent date handling for BI
Date inconsistencies are a common reason dashboards disagree. Establish a single rule per dataset and apply it everywhere in the query.
Choose the business timezone: convert timestamps before truncating to date. Truncating in UTC and then converting can shift transactions across days.
Define the reporting date per metric: revenue might be based on payment date, while shipments use ship date. If you mix them, label them explicitly (report_date_paid, report_date_shipped) or build separate datasets.
Use a calendar dimension: it standardizes week starts, fiscal periods, holidays, and makes “missing days” visible.
Validation techniques: prove the dataset is correct
Validation is not a one-time step. Treat it as part of building: every time you add a join, filter, or metric, run checks that would catch row multiplication, missing records, or definition drift.
1) Reconcile totals to the source system (control totals)
Pick a small set of “control totals” that should match a trusted source (or a simpler query). Reconcile at multiple levels: overall, by day, and by a major dimension (e.g., region).
/* Control total: revenue by day from the simplest trusted logic */
WITH trusted AS (
SELECT
CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE) AS report_date,
SUM(oi.extended_amount) AS trusted_revenue
FROM fct_orders o
JOIN fct_order_items oi
ON o.order_id = oi.order_id
WHERE o.order_status IN ('PAID','FULFILLED')
AND oi.item_status NOT IN ('CANCELLED')
AND CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE)
BETWEEN DATE '2025-01-01' AND DATE '2025-01-31'
GROUP BY CAST(o.order_placed_at AT TIME ZONE 'America/New_York' AS DATE)
),
dashboard AS (
SELECT report_date, SUM(revenue) AS dashboard_revenue
FROM daily_sales_dashboard_v1
WHERE report_date BETWEEN DATE '2025-01-01' AND DATE '2025-01-31'
GROUP BY report_date
)
SELECT
t.report_date,
t.trusted_revenue,
d.dashboard_revenue,
(d.dashboard_revenue - t.trusted_revenue) AS diff
FROM trusted t
JOIN dashboard d
ON t.report_date = d.report_date
ORDER BY t.report_dateIf differences exist, investigate whether the dataset changed the definition (filters/statuses), changed the date logic (timezone/truncation), or introduced duplication via joins.
2) Compare aggregates across join paths (detect duplication)
A powerful duplication test is to compute the same metric through two logically equivalent paths and compare results. If they differ, a join is multiplying rows or filtering unexpectedly.
/* Path A: revenue from base_fact (already pre-aggregated) */
WITH a AS (
SELECT report_date, SUM(revenue) AS rev_a
FROM base_fact
GROUP BY report_date
),
/* Path B: revenue after joining dimensions (should not change totals) */
b AS (
SELECT bf.report_date, SUM(bf.revenue) AS rev_b
FROM base_fact bf
JOIN dim_product_one dp
ON bf.product_id = dp.product_id
JOIN dim_customer_one dc
ON bf.customer_id = dc.customer_id
GROUP BY bf.report_date
)
SELECT
a.report_date,
a.rev_a,
b.rev_b,
(b.rev_b - a.rev_a) AS diff
FROM a
JOIN b
ON a.report_date = b.report_date
ORDER BY a.report_dateIf rev_b is larger than rev_a, a dimension join is likely one-to-many. If rev_b is smaller, the join is filtering out facts (missing dimension keys, inner join when you needed left join, or is_current filters removing valid keys).
3) Check duplicate keys in the final output (grain enforcement)
Your final dataset should have a unique key that matches the declared grain. Test it directly.
/* Expected unique key: report_date + category + region */
SELECT
report_date,
category,
region,
COUNT(*) AS row_count
FROM daily_sales_dashboard_v1
GROUP BY report_date, category, region
HAVING COUNT(*) > 1
ORDER BY row_count DESCIf this returns rows, your dataset is not at the grain you think it is. Common causes: joining to a non-unique dimension, including additional hidden dimensions (like brand) without grouping, or mixing customer-level and product-level facts.
4) Sanity-check queries for fast QA
Sanity checks are quick queries you run every time you modify the dataset. They help catch obvious issues early.
Row counts by date: sudden spikes can indicate duplication; sudden drops can indicate filtering.
SELECT
report_date,
COUNT(*) AS rows,
SUM(revenue) AS revenue
FROM daily_sales_dashboard_v1
GROUP BY report_date
ORDER BY report_dateTop contributors: verify the biggest categories/regions look plausible and stable.
SELECT
category,
SUM(revenue) AS revenue
FROM daily_sales_dashboard_v1
WHERE report_date BETWEEN DATE '2025-01-01' AND DATE '2025-01-31'
GROUP BY category
ORDER BY revenue DESC
FETCH FIRST 10 ROWS ONLYNegative or impossible values: catch sign errors, refunds mishandled, or unit counts below zero.
SELECT *
FROM daily_sales_dashboard_v1
WHERE revenue < 0 OR units < 0
ORDER BY report_dateNull dimension coverage: if category or region is null, you may be losing mapping coverage or filtering incorrectly.
SELECT
SUM(CASE WHEN category IS NULL THEN 1 ELSE 0 END) AS null_category_rows,
SUM(CASE WHEN region IS NULL THEN 1 ELSE 0 END) AS null_region_rows
FROM daily_sales_dashboard_v1Documentation inside SQL: definitions, assumptions, and change safety
Dashboard datasets become shared assets. Documenting business definitions and assumptions inside the SQL makes the dataset auditable and reduces “metric drift” when someone edits it later.
What to document in comments
Dataset purpose and grain: “one row per day x category x region.”
Metric definitions: what is included/excluded, status filters, refund handling, currency assumptions.
Date logic: timezone conversion, which timestamp is used, fiscal calendar mapping.
Join expectations: which joins must be one-to-one, and what to do if dimension keys are missing.
Known limitations: e.g., “region is based on current customer profile; historical region changes are not backfilled.”
Validation checklist: the specific control totals and duplicate-key checks you run.
Example: a documented, maintainable dataset query
/* daily_sales_dashboard_v1
Purpose: Dashboard-ready daily sales metrics for BI slicing by category and region.
Grain (unique key): report_date, category, region
Reporting date:
- report_date = order_placed_at converted to America/New_York, truncated to DATE.
Revenue definition:
- Sum of fct_order_items.extended_amount
- Included orders: order_status IN ('PAID','FULFILLED')
- Excluded items: item_status = 'CANCELLED'
Dimension rules:
- dim_product and dim_customer must be unique on their IDs for is_current = TRUE.
- If dimension coverage gaps appear, investigate missing keys before changing joins.
Validation (run on every change):
1) Reconcile revenue by day to trusted base query for the same date range.
2) Compare totals before vs after dimension joins (should match).
3) Duplicate key check on (report_date, category, region) must return zero rows.
*/
WITH params AS (
SELECT DATE '2025-01-01' AS start_date, DATE '2025-01-31' AS end_date
),
calendar AS (
SELECT c.calendar_date AS report_date
FROM dim_calendar c
JOIN params p
ON c.calendar_date BETWEEN p.start_date AND p.end_date
),
order_items_clean AS (
SELECT order_id, product_id, quantity, extended_amount
FROM fct_order_items
WHERE item_status <> 'CANCELLED'
),
orders_clean AS (
SELECT
order_id,
customer_id,
CAST(order_placed_at AT TIME ZONE 'America/New_York' AS DATE) AS report_date
FROM fct_orders
WHERE order_status IN ('PAID','FULFILLED')
),
base_fact AS (
SELECT
oc.report_date,
oc.customer_id,
oic.product_id,
SUM(oic.quantity) AS units,
SUM(oic.extended_amount) AS revenue
FROM orders_clean oc
JOIN order_items_clean oic
ON oc.order_id = oic.order_id
GROUP BY oc.report_date, oc.customer_id, oic.product_id
),
dim_product_one AS (
SELECT product_id, category
FROM dim_product
WHERE is_current = TRUE
),
dim_customer_one AS (
SELECT customer_id, region
FROM dim_customer
WHERE is_current = TRUE
),
fact_enriched AS (
SELECT
bf.report_date,
dp.category,
dc.region,
SUM(bf.units) AS units,
SUM(bf.revenue) AS revenue,
COUNT(DISTINCT bf.customer_id) AS active_customers
FROM base_fact bf
JOIN dim_product_one dp
ON bf.product_id = dp.product_id
JOIN dim_customer_one dc
ON bf.customer_id = dc.customer_id
GROUP BY bf.report_date, dp.category, dc.region
)
SELECT
cal.report_date,
fe.category,
fe.region,
COALESCE(fe.units, 0) AS units,
COALESCE(fe.revenue, 0) AS revenue,
COALESCE(fe.active_customers, 0) AS active_customers
FROM calendar cal
LEFT JOIN fact_enriched fe
ON cal.report_date = fe.report_date