All courses > Technology and Programming > Data Science and Business Intelligence ::

Core BI Vocabulary: Facts, Dimensions, Measures, and Grain

Capítulo 2

Estimated reading time: 8 minutes

Why this vocabulary matters in BI modeling

Most reporting errors come from unclear definitions of what a dataset row represents and how numbers should be aggregated. BI vocabulary helps you model data so that filters, joins, and totals behave predictably.

Facts vs. dimensions

Fact (fact table)

A fact is a record of a business event or observation, usually containing numeric values you want to aggregate (e.g., revenue, quantity) and foreign keys pointing to descriptive context (dimensions).

Examples of facts:

Order line (a product sold on an order)
Invoice line
Shipment event
Website session
Daily inventory snapshot

Dimension (dimension table)

A dimension provides descriptive context for facts: who, what, when, where, how. Dimensions are used for filtering, grouping, and labeling.

Examples of dimensions:

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Date (day, week, month, fiscal period)
Product (SKU, brand, category)
Customer (segment, industry)
Region (country, state, sales territory)
Sales rep (team, manager)

How to tell them apart

If it answers “how much/how many/how long,” it is typically a fact (or a measure inside a fact).
If it answers “which one/what kind/where/when/who,” it is typically a dimension (or an attribute inside a dimension).

Measures vs. attributes

Measure

A measure is a numeric field intended for aggregation (sum, average, min/max, etc.). Measures usually live in fact tables.

Examples: sales_amount, quantity, discount_amount, shipping_cost.

Attribute

An attribute is a descriptive field used for grouping/filtering. Attributes usually live in dimension tables, but sometimes appear in facts as degenerate dimensions (e.g., order number).

Examples: product_category, customer_segment, region_name, order_status.

Common confusion: “numeric attribute” vs “measure”

Not every numeric field is a measure. A numeric code (e.g., store_id, postal_code) is an attribute/key, not a measure. Aggregating it (SUM of IDs) is meaningless.

Grain: what a row represents

Grain (also called granularity) is the precise definition of what one row in a table represents. A correct grain statement prevents double counting and broken joins.

Examples of grain statements

Order-line fact: “One row per order_id + line_number.”
Daily sales snapshot: “One row per store_id + date.”
Customer dimension: “One row per customer_id (current view).”
Customer history (SCD Type 2): “One row per customer_id + effective_start_date.”

Why grain drives everything

It determines the primary key (unique identifier) of the table.
It determines which joins are safe (1-to-many vs many-to-many).
It determines which measures can be summed without distortion.

Guided example: Sales at order-line grain

Step 1: Define the grain

Grain statement: “One row per order line.” Practically: one row per order_id + line_number.

Step 2: Identify keys and measures

Fact table: fact_order_line

Primary key (composite): (order_id, line_number)
Foreign keys to dimensions: order_date_key, product_key, customer_key, region_key
Measures: quantity, net_sales_amount, discount_amount

Step 3: Add dimensions for context

Dimension tables:

dim_date(date_key, date, month, quarter, year, fiscal_period, ...)
dim_product(product_key, sku, brand, category, ...)
dim_customer(customer_key, customer_id, segment, ...)
dim_region(region_key, country, state, territory, ...)

Step 4: Visualize the star join paths

fact_order_line --(order_date_key)--> dim_date
fact_order_line --(product_key)-----> dim_product
fact_order_line --(customer_key)----> dim_customer
fact_order_line --(region_key)------> dim_region

Each dimension join should be many fact rows to one dimension row (many-to-one). If a join becomes many-to-many, totals can inflate.

How aggregation works (and where it breaks)

Basic aggregation: SUM

If your grain is order-line, then summing line-level measures is usually valid:

-- Total net sales by month and category
SELECT
  d.year,
  d.month,
  p.category,
  SUM(f.net_sales_amount) AS total_net_sales
FROM fact_order_line f
JOIN dim_date d     ON f.order_date_key = d.date_key
JOIN dim_product p  ON f.product_key = p.product_key
GROUP BY d.year, d.month, p.category;

This works because net_sales_amount is additive across order lines, dates, products, and customers.

Counting: COUNT vs COUNT DISTINCT

At order-line grain:

COUNT(*) counts lines (how many line items).
COUNT(DISTINCT order_id) counts orders (how many orders).

-- Orders vs lines by region
SELECT
  r.territory,
  COUNT(*) AS line_count,
  COUNT(DISTINCT f.order_id) AS order_count
FROM fact_order_line f
JOIN dim_region r ON f.region_key = r.region_key
GROUP BY r.territory;

Practical tip: If a business question is “number of orders,” using COUNT(*) on an order-line fact will overcount whenever orders have multiple lines.

Where aggregation fails: double counting from join duplication

A classic failure happens when you join a fact table to a table that has multiple matching rows per key (many-to-many or one-to-many in the wrong direction).

Example problem: You join fact_order_line to a customer_contacts table that has multiple contacts per customer. Each order line is duplicated for each contact, inflating sums.

-- Dangerous join: duplicates fact rows
SELECT
  c.customer_id,
  SUM(f.net_sales_amount) AS net_sales
FROM fact_order_line f
JOIN dim_customer c       ON f.customer_key = c.customer_key
JOIN customer_contacts cc ON c.customer_id = cc.customer_id
GROUP BY c.customer_id;

Fix patterns:

Aggregate contacts to one row per customer before joining.
Join to a dimension that is guaranteed one row per key.
Use bridge tables with explicit weighting only when required and well understood.

Where aggregation fails: mixed grains in the same query

Mixed grain means combining measures recorded at different row meanings without aligning them.

Example:

fact_order_line: one row per order line (transactional)
fact_daily_inventory: one row per product + store + day (snapshot)

If you join them directly on product and date, each inventory row may match many order lines, duplicating inventory values (or duplicating sales, depending on join direction).

Safer approach: Aggregate each fact to a common grain first, then join the aggregated results.

-- Align grains by aggregating sales to product-day first
WITH sales_by_product_day AS (
  SELECT
    f.product_key,
    f.order_date_key,
    SUM(f.quantity) AS qty_sold,
    SUM(f.net_sales_amount) AS net_sales
  FROM fact_order_line f
  GROUP BY f.product_key, f.order_date_key
)
SELECT
  d.date,
  p.sku,
  s.qty_sold,
  i.on_hand_qty
FROM sales_by_product_day s
JOIN fact_daily_inventory i
  ON s.product_key = i.product_key
 AND s.order_date_key = i.date_key
JOIN dim_date d    ON s.order_date_key = d.date_key
JOIN dim_product p ON s.product_key = p.product_key;

Additive, semi-additive, and non-additive measures

Measure type	Definition	Examples	Typical safe aggregations
Additive	Can be summed across all relevant dimensions (including time)	Net sales, quantity, tax amount	SUM across product, customer, region, date
Semi-additive	Can be summed across some dimensions but not time	Account balance, inventory on hand (snapshot)	SUM across products/stores at a point in time; use LAST/AVG over time
Non-additive	Cannot be summed meaningfully	Ratios (margin %), unit price, conversion rate	Recompute from additive components (e.g., SUM(margin)/SUM(sales))

Step-by-step: handling a non-additive measure (margin %)

Suppose you store margin_amount and net_sales_amount at order-line grain. To report margin % by category:

Do not average line-level margin % (it weights each line equally, not by sales).
Compute margin % from aggregated totals.

SELECT
  p.category,
  SUM(f.margin_amount) / NULLIF(SUM(f.net_sales_amount), 0) AS margin_pct
FROM fact_order_line f
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY p.category;

Mini-checklist: validate a dataset before building reports

1) Grain statement

Write a one-sentence grain statement for every table you will use.
Confirm the grain matches the business question (orders vs order lines vs customers vs daily snapshots).

2) Primary keys and uniqueness

Identify the primary key (or composite key) that should uniquely identify each row.
Test uniqueness (no duplicates). If duplicates exist, decide whether they are valid (e.g., history tables) or a data quality issue.

3) Join paths and cardinality

For each join, state expected cardinality: many-to-one (preferred for fact-to-dimension), one-to-one, or many-to-many (risk).
Check whether joins duplicate fact rows (a quick test is comparing SUM(net_sales_amount) before and after joins).
Ensure there is a single, unambiguous path from facts to dimensions (avoid accidental fan-out via multiple routes).

4) Measure additivity rules

Label each measure as additive, semi-additive, or non-additive.
Define the correct aggregation function per measure (SUM, MAX, LAST, AVG, recompute ratio).
For counts, specify the entity being counted (lines, orders, customers) and whether COUNT DISTINCT is required.

5) Mixed-grain detection

If a query uses more than one fact-like table, confirm they share a common grain or are aggregated to a common grain before joining.
Watch for snapshot vs transaction combinations (inventory/balances with sales/events).

Now answer the exercise about the content:

You need to report the number of orders by territory using a fact table at order-line grain (one row per order_id + line_number). Which approach correctly counts orders without overcounting multi-line orders?

You are right! Congratulations, now go to the next page

You missed! Try again.

At order-line grain, COUNT(*) counts line items, not orders. To count orders, use COUNT(DISTINCT order_id) so each order is counted once even when it has multiple lines.