All courses > Technology and Programming > Data Science and Business Intelligence ::

Choosing the Grain: The Foundation of Reliable Metrics in Fact Tables

Capítulo 3

Estimated reading time: 8 minutes

Why grain is the most important decision in a fact table

The grain of a fact table is the exact level of detail represented by one row in that table. It is not a vague idea like “sales data”; it is a precise statement such as “one row per order line” or “one row per product per day.” Grain is the foundation of reliable metrics because every aggregation (counts, sums, averages, distinct counts) assumes a consistent meaning for each row.

If you choose the wrong grain, or mix multiple grains in one table, you will see symptoms like: totals that change depending on which report you run, counts that look inflated, and distinct counts that disagree across dashboards. These issues are not “report bugs”; they are usually grain problems.

How grain affects common calculations

Counts: COUNT(*) counts rows. If your grain is order line, COUNT(*) is “number of order lines,” not “number of orders.” If your grain is order header, COUNT(*) is “number of orders.”
Sums: SUM(Revenue) is safe only if Revenue is stored at the same grain as the row. If you store order-level revenue on every order line, SUM(Revenue) will multiply revenue by the number of lines per order.
Distinct counts: COUNT(DISTINCT OrderID) can recover “number of orders” from an order-line fact, but it is more expensive and can still be wrong if OrderID is missing, duplicated, or not aligned with the grain.

Sales example: three valid grains and what they mean

Consider a sales process with orders, order lines, products, customers, and dates. The same business process can be modeled at different grains depending on the questions you need to answer.

Grain A: Order header (one row per order)

Grain statement: one row per OrderID (order header).

Best for: order counts, order-level revenue, shipping fees, order-level discounts, payment method analysis.
Not good for: product-level analysis (because an order can contain multiple products).

Grain B: Order line (one row per order line item)

Grain statement: one row per OrderID + LineNumber (or OrderLineID).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Best for: product mix, units sold by product, revenue by product/category, basket analysis (with care).
Not good for: measures that exist only once per order (shipping fee, order-level discount) unless modeled correctly.

Grain C: Daily product totals (one row per Date + Product)

Grain statement: one row per Date + Product (daily aggregate).

Best for: fast reporting of daily trends by product, simplified dashboards, reduced storage.
Not good for: customer-level analysis, order counts, order-level attributes, or any question requiring transaction detail.

What “mismatched grain” looks like in reports

Mismatched grain happens when a measure is stored at a different level than the row, or when a table mixes rows of different levels. The result is inconsistent metrics across reports because different queries aggregate differently.

Example: order-level shipping fee stored on order lines

Suppose each order has a $10 shipping fee, and you store ShippingFee = 10 on every order line. An order with 3 lines now has ShippingFee repeated 3 times.

Correct business meaning: total shipping fees = $10 per order.
What SUM(ShippingFee) does on an order-line fact: $10 multiplied by number of lines per order.

This creates a classic inconsistency: a report grouped by OrderID might look fine if it uses MAX(ShippingFee) or shows one row per order after aggregation, but a report grouped by Product or Category will overstate shipping fees because it sums repeated values.

Step-by-step method to choose and validate grain

Step 1: Write a grain statement (one sentence)

Write a sentence that completes: “One row in this fact table represents …” and make it testable.

Order header: “One row represents one order (OrderID).”
Order line: “One row represents one order line (OrderID + LineNumber).”
Daily product totals: “One row represents one product on one calendar day (DateKey + ProductKey).”

If you cannot write a single sentence without using “sometimes” or “depending,” the grain is not defined yet.

Step 2: List the required dimension keys at that grain

At the chosen grain, each row must have the dimension keys needed to slice the measures correctly. Think of these as the minimum foreign keys required for the row to be unambiguous.

Order header grain: required keys

DateKey (order date)
CustomerKey
OrderStatusKey (or a small status dimension)
ChannelKey (web, store, etc.)
Optional: ShipToLocationKey, BillToLocationKey

Order line grain: required keys

DateKey (order date, or line ship date depending on definition)
CustomerKey
ProductKey
Optional: PromotionKey, SalesRepKey, ChannelKey

Daily product totals grain: required keys

DateKey
ProductKey
Optional: StoreKey or ChannelKey if totals are not global

Rule of thumb: if a dimension attribute can vary within what you call “one row,” then that dimension key must be part of the grain (or it belongs in a different fact table).

Step 3: List measures that are valid at that grain

Measures must match the grain. Classify them by how they aggregate.

Additive: can be summed across all dimensions (e.g., SalesAmount, Quantity).
Semi-additive: can be summed across some dimensions but not others (commonly time). Example: account balance is additive across accounts but not across time. In sales, an example could be “open order amount” if it represents a point-in-time snapshot.
Non-additive: cannot be summed meaningfully; often ratios or distinct counts (e.g., GrossMarginPct, AverageOrderValue, DistinctCustomers).

Order header grain: valid measures

Additive: OrderRevenue, ShippingFee (once per order), TaxAmount (if stored once per order)
Non-additive: DistinctCustomers (computed), AverageOrderValue (computed)

Order line grain: valid measures

Additive: LineSalesAmount, LineDiscountAmount, Quantity
Non-additive: DistinctOrders (computed as COUNT DISTINCT OrderID), UnitPrice (better computed as LineSalesAmount / Quantity)
Potentially problematic: ShippingFee, OrderLevelDiscount (these are not line-grain measures unless allocated)

Daily product totals grain: valid measures

Additive: DailyUnitsSold, DailySalesAmount
Non-additive: DistinctOrders (not available unless separately tracked), AverageSellingPrice (computed)

Important: “Non-additive” does not mean “not useful.” It means “do not store it as a simple number and SUM it.” Store the components (numerator/denominator) or compute it in the semantic layer/query.

Step 4: Test the grain with sample queries (correct vs incorrect)

Use small, targeted queries to prove that the table behaves as intended. The goal is to catch double-counting and invalid aggregations early.

Assume an order-line fact table named FactSalesLine with columns: OrderID, LineNumber, DateKey, CustomerKey, ProductKey, Quantity, LineSalesAmount. Also assume a date dimension DimDate with DateKey and CalendarMonth.

Test 1: Row count should match the grain

If the grain is “one row per order line,” then (OrderID, LineNumber) should be unique.

-- Should return 0 rows if grain is truly order line unique per (OrderID, LineNumber)  SELECT OrderID, LineNumber, COUNT(*) AS RowCount FROM FactSalesLine GROUP BY OrderID, LineNumber HAVING COUNT(*) > 1;

Test 2: Correct order count from an order-line fact

COUNT(*) is not “orders” at line grain. Use COUNT DISTINCT for orders.

-- Correct: number of orders by month  SELECT d.CalendarMonth, COUNT(DISTINCT f.OrderID) AS Orders FROM FactSalesLine f JOIN DimDate d ON d.DateKey = f.DateKey GROUP BY d.CalendarMonth ORDER BY d.CalendarMonth;

-- Incorrect: counts lines, not orders  SELECT d.CalendarMonth, COUNT(*) AS OrdersWrong FROM FactSalesLine f JOIN DimDate d ON d.DateKey = f.DateKey GROUP BY d.CalendarMonth ORDER BY d.CalendarMonth;

Test 3: Correct revenue by product at line grain

LineSalesAmount is additive at line grain, so summing by product is valid.

SELECT f.ProductKey, SUM(f.LineSalesAmount) AS Revenue FROM FactSalesLine f GROUP BY f.ProductKey ORDER BY Revenue DESC;

Test 4: Demonstrate double-counting from mismatched grain (shipping fee example)

Assume someone added ShippingFee to FactSalesLine and populated it with the full order shipping fee on every line. This is mismatched grain.

-- Incorrect: shipping fee is repeated per line, so this overstates totals  SELECT d.CalendarMonth, SUM(f.ShippingFee) AS ShippingWrong FROM FactSalesLine f JOIN DimDate d ON d.DateKey = f.DateKey GROUP BY d.CalendarMonth;

-- Less wrong but still a warning sign: using MAX per order to avoid duplication  -- (This suggests ShippingFee belongs in an order-header fact, or must be allocated.)  SELECT d.CalendarMonth, SUM(x.ShippingFeePerOrder) AS ShippingBetter FROM (   SELECT OrderID, MAX(ShippingFee) AS ShippingFeePerOrder, MIN(DateKey) AS DateKey   FROM FactSalesLine   GROUP BY OrderID ) x JOIN DimDate d ON d.DateKey = x.DateKey GROUP BY d.CalendarMonth;

If you need shipping analysis by product, you must decide on a business rule (e.g., allocate shipping by line revenue or by quantity) and store an allocated line-level shipping measure, or keep shipping only at order grain and avoid slicing it by product.

Test 5: Daily product totals vs line grain (why aggregates can’t answer detail questions)

Assume a daily aggregate fact FactDailyProductSales with columns: DateKey, ProductKey, DailyUnitsSold, DailySalesAmount.

-- Valid: daily trend by product  SELECT d.CalendarMonth, f.ProductKey, SUM(f.DailySalesAmount) AS Revenue FROM FactDailyProductSales f JOIN DimDate d ON d.DateKey = f.DateKey GROUP BY d.CalendarMonth, f.ProductKey;

-- Not possible from this grain: number of orders (OrderID doesn't exist)  -- Any attempt to estimate orders from daily totals will be a different metric, not the same one.

Practical checklist for reliable metrics

Define grain in one sentence and make it testable with uniqueness checks.
Ensure every row has the dimension keys required to uniquely describe that grain.
Store only measures that naturally occur at that grain; move others to a separate fact table at the correct grain or allocate them with an explicit rule.
Validate with queries that compare COUNT(*) vs COUNT DISTINCT, and that test for double-counting when slicing by different dimensions.
If two reports disagree, first ask: “Are they aggregating the same grain?” before changing calculations.

Now answer the exercise about the content:

In an order-line fact table (one row per OrderID + LineNumber), which approach correctly returns the number of orders by month?

You are right! Congratulations, now go to the next page

You missed! Try again.

At order-line grain, each row is an order line, so COUNT(*) counts lines, not orders. To count orders, use COUNT(DISTINCT OrderID). Measures like shipping can be double-counted if repeated on each line.