Why this vocabulary matters in BI modeling
Most reporting errors come from unclear definitions of what a dataset row represents and how numbers should be aggregated. BI vocabulary helps you model data so that filters, joins, and totals behave predictably.
Facts vs. dimensions
Fact (fact table)
A fact is a record of a business event or observation, usually containing numeric values you want to aggregate (e.g., revenue, quantity) and foreign keys pointing to descriptive context (dimensions).
Examples of facts:
- Order line (a product sold on an order)
- Invoice line
- Shipment event
- Website session
- Daily inventory snapshot
Dimension (dimension table)
A dimension provides descriptive context for facts: who, what, when, where, how. Dimensions are used for filtering, grouping, and labeling.
Examples of dimensions:
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
- Date (day, week, month, fiscal period)
- Product (SKU, brand, category)
- Customer (segment, industry)
- Region (country, state, sales territory)
- Sales rep (team, manager)
How to tell them apart
- If it answers “how much/how many/how long,” it is typically a fact (or a measure inside a fact).
- If it answers “which one/what kind/where/when/who,” it is typically a dimension (or an attribute inside a dimension).
Measures vs. attributes
Measure
A measure is a numeric field intended for aggregation (sum, average, min/max, etc.). Measures usually live in fact tables.
Examples: sales_amount, quantity, discount_amount, shipping_cost.
Attribute
An attribute is a descriptive field used for grouping/filtering. Attributes usually live in dimension tables, but sometimes appear in facts as degenerate dimensions (e.g., order number).
Examples: product_category, customer_segment, region_name, order_status.
Common confusion: “numeric attribute” vs “measure”
Not every numeric field is a measure. A numeric code (e.g., store_id, postal_code) is an attribute/key, not a measure. Aggregating it (SUM of IDs) is meaningless.
Grain: what a row represents
Grain (also called granularity) is the precise definition of what one row in a table represents. A correct grain statement prevents double counting and broken joins.
Examples of grain statements
- Order-line fact: “One row per order_id + line_number.”
- Daily sales snapshot: “One row per store_id + date.”
- Customer dimension: “One row per customer_id (current view).”
- Customer history (SCD Type 2): “One row per customer_id + effective_start_date.”
Why grain drives everything
- It determines the primary key (unique identifier) of the table.
- It determines which joins are safe (1-to-many vs many-to-many).
- It determines which measures can be summed without distortion.
Guided example: Sales at order-line grain
Step 1: Define the grain
Grain statement: “One row per order line.” Practically: one row per order_id + line_number.
Step 2: Identify keys and measures
Fact table: fact_order_line
- Primary key (composite):
(order_id, line_number) - Foreign keys to dimensions:
order_date_key,product_key,customer_key,region_key - Measures:
quantity,net_sales_amount,discount_amount
Step 3: Add dimensions for context
Dimension tables:
dim_date(date_key, date, month, quarter, year, fiscal_period, ...)dim_product(product_key, sku, brand, category, ...)dim_customer(customer_key, customer_id, segment, ...)dim_region(region_key, country, state, territory, ...)
Step 4: Visualize the star join paths
fact_order_line --(order_date_key)--> dim_date
fact_order_line --(product_key)-----> dim_product
fact_order_line --(customer_key)----> dim_customer
fact_order_line --(region_key)------> dim_regionEach dimension join should be many fact rows to one dimension row (many-to-one). If a join becomes many-to-many, totals can inflate.
How aggregation works (and where it breaks)
Basic aggregation: SUM
If your grain is order-line, then summing line-level measures is usually valid:
-- Total net sales by month and category
SELECT
d.year,
d.month,
p.category,
SUM(f.net_sales_amount) AS total_net_sales
FROM fact_order_line f
JOIN dim_date d ON f.order_date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY d.year, d.month, p.category;This works because net_sales_amount is additive across order lines, dates, products, and customers.
Counting: COUNT vs COUNT DISTINCT
At order-line grain:
COUNT(*)counts lines (how many line items).COUNT(DISTINCT order_id)counts orders (how many orders).
-- Orders vs lines by region
SELECT
r.territory,
COUNT(*) AS line_count,
COUNT(DISTINCT f.order_id) AS order_count
FROM fact_order_line f
JOIN dim_region r ON f.region_key = r.region_key
GROUP BY r.territory;Practical tip: If a business question is “number of orders,” using COUNT(*) on an order-line fact will overcount whenever orders have multiple lines.
Where aggregation fails: double counting from join duplication
A classic failure happens when you join a fact table to a table that has multiple matching rows per key (many-to-many or one-to-many in the wrong direction).
Example problem: You join fact_order_line to a customer_contacts table that has multiple contacts per customer. Each order line is duplicated for each contact, inflating sums.
-- Dangerous join: duplicates fact rows
SELECT
c.customer_id,
SUM(f.net_sales_amount) AS net_sales
FROM fact_order_line f
JOIN dim_customer c ON f.customer_key = c.customer_key
JOIN customer_contacts cc ON c.customer_id = cc.customer_id
GROUP BY c.customer_id;Fix patterns:
- Aggregate contacts to one row per customer before joining.
- Join to a dimension that is guaranteed one row per key.
- Use bridge tables with explicit weighting only when required and well understood.
Where aggregation fails: mixed grains in the same query
Mixed grain means combining measures recorded at different row meanings without aligning them.
Example:
fact_order_line: one row per order line (transactional)fact_daily_inventory: one row per product + store + day (snapshot)
If you join them directly on product and date, each inventory row may match many order lines, duplicating inventory values (or duplicating sales, depending on join direction).
Safer approach: Aggregate each fact to a common grain first, then join the aggregated results.
-- Align grains by aggregating sales to product-day first
WITH sales_by_product_day AS (
SELECT
f.product_key,
f.order_date_key,
SUM(f.quantity) AS qty_sold,
SUM(f.net_sales_amount) AS net_sales
FROM fact_order_line f
GROUP BY f.product_key, f.order_date_key
)
SELECT
d.date,
p.sku,
s.qty_sold,
i.on_hand_qty
FROM sales_by_product_day s
JOIN fact_daily_inventory i
ON s.product_key = i.product_key
AND s.order_date_key = i.date_key
JOIN dim_date d ON s.order_date_key = d.date_key
JOIN dim_product p ON s.product_key = p.product_key;Additive, semi-additive, and non-additive measures
| Measure type | Definition | Examples | Typical safe aggregations |
|---|---|---|---|
| Additive | Can be summed across all relevant dimensions (including time) | Net sales, quantity, tax amount | SUM across product, customer, region, date |
| Semi-additive | Can be summed across some dimensions but not time | Account balance, inventory on hand (snapshot) | SUM across products/stores at a point in time; use LAST/AVG over time |
| Non-additive | Cannot be summed meaningfully | Ratios (margin %), unit price, conversion rate | Recompute from additive components (e.g., SUM(margin)/SUM(sales)) |
Step-by-step: handling a non-additive measure (margin %)
Suppose you store margin_amount and net_sales_amount at order-line grain. To report margin % by category:
- Do not average line-level margin % (it weights each line equally, not by sales).
- Compute margin % from aggregated totals.
SELECT
p.category,
SUM(f.margin_amount) / NULLIF(SUM(f.net_sales_amount), 0) AS margin_pct
FROM fact_order_line f
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY p.category;Mini-checklist: validate a dataset before building reports
1) Grain statement
- Write a one-sentence grain statement for every table you will use.
- Confirm the grain matches the business question (orders vs order lines vs customers vs daily snapshots).
2) Primary keys and uniqueness
- Identify the primary key (or composite key) that should uniquely identify each row.
- Test uniqueness (no duplicates). If duplicates exist, decide whether they are valid (e.g., history tables) or a data quality issue.
3) Join paths and cardinality
- For each join, state expected cardinality: many-to-one (preferred for fact-to-dimension), one-to-one, or many-to-many (risk).
- Check whether joins duplicate fact rows (a quick test is comparing
SUM(net_sales_amount)before and after joins). - Ensure there is a single, unambiguous path from facts to dimensions (avoid accidental fan-out via multiple routes).
4) Measure additivity rules
- Label each measure as additive, semi-additive, or non-additive.
- Define the correct aggregation function per measure (SUM, MAX, LAST, AVG, recompute ratio).
- For counts, specify the entity being counted (lines, orders, customers) and whether
COUNT DISTINCTis required.
5) Mixed-grain detection
- If a query uses more than one fact-like table, confirm they share a common grain or are aggregated to a common grain before joining.
- Watch for snapshot vs transaction combinations (inventory/balances with sales/events).