All courses > Technology and Programming > Data Science and Business Intelligence ::

Data Modeling for BI: Star Schemas, Conformed Dimensions, and Join Behavior

Capítulo 7

Estimated reading time: 11 minutes

Star Schema Modeling in Practice

A star schema is a BI-friendly way to organize analytical data so that queries are predictable, fast, and easy to reason about. It centers on a fact table (events you measure) surrounded by dimension tables (descriptive context used to slice and filter). The “star” shape comes from many dimensions joining into one central fact.

Core components and how they behave

Fact table: Contains numeric measures (e.g., revenue, quantity) and foreign keys to dimensions. Facts are typically large and append-heavy.
Dimension table: Contains attributes used for grouping/filtering (e.g., product category, customer segment). Dimensions are typically smaller and change more slowly.
Relationships: In a classic star, each fact row joins to exactly one row in each relevant dimension (many-to-one from fact to dimension).

Example: Sales star schema

Imagine an order line fact table at the grain “one row per order line item.”

fact_sales_order_line
- order_line_id (degenerate or operational id)
- order_date_key
- ship_date_key
- customer_key
- product_key
- store_key
- promotion_key
- quantity
- gross_amount
- discount_amount
- net_amount

Dimensions provide the descriptive context:

dim_date
- date_key (surrogate)
- calendar_date
- year, quarter, month, week, day_of_week, is_holiday

dim_customer
- customer_key (surrogate)
- customer_id (natural/business key)
- name
- segment
- region
- effective_start_date, effective_end_date, is_current (if tracking history)

dim_product
- product_key (surrogate)
- sku (natural/business key)
- brand
- category
- subcategory
- size

dim_store
- store_key (surrogate)
- store_id
- store_type
- city
- state

dim_promotion
- promotion_key (surrogate)
- promotion_code
- channel
- discount_type

Two practical notes:

Role-playing dimensions: The same dimension can be used multiple times with different meaning, e.g., order_date_key and ship_date_key both join to dim_date.
Degenerate dimensions: Some identifiers (like order_id) may live in the fact table without a separate dimension when they are used mainly for drill-through rather than filtering/grouping.

Conformed Dimensions for Cross-Domain Reporting

Conformed dimensions are shared dimensions that mean the same thing across multiple fact tables (and often across multiple business domains). They are the key to answering questions that span processes without “apples vs oranges” definitions.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

What “conformed” means in practice

Same grain and keys: The dimension represents the same entity at the same level (e.g., one row per customer version).
Consistent attribute definitions: “Region,” “Segment,” “Active customer,” etc. are defined once and reused.
Reusable surrogate keys: Facts from different domains reference the same surrogate key space (or a reliably mapped key space) so joins align.

Example: Sales + Support with a conformed customer dimension

Suppose you have two fact tables:

fact_sales_order_line (customer_key, product_key, order_date_key, net_amount, ...)
fact_support_ticket (customer_key, ticket_date_key, resolution_time_minutes, ...)

If both facts join to the same dim_customer, you can build metrics like “Net revenue by customer segment and average resolution time” with consistent segment logic.

Common pitfalls and how to avoid them

Different customer definitions: Sales may treat “customer” as billing account; support may treat “customer” as end user. If they differ, create separate dimensions (e.g., dim_account and dim_end_user) and be explicit in reporting.
Partial conformance: Sometimes only a subset of attributes can be shared. Keep the shared core in the conformed dimension and put domain-specific attributes in “outrigger” dimensions or separate domain dimensions to avoid bloating.
Inconsistent hierarchies: If product categories differ by channel, decide whether to standardize a single hierarchy or maintain multiple hierarchies with clear naming (e.g., category_retail vs category_wholesale).

Surrogate Keys and Slowly Changing Dimensions (SCD)

In BI, you often need to preserve history: customers move regions, products change categories, stores change ownership. Surrogate keys and SCD patterns help you keep facts stable while dimensions evolve.

Surrogate keys: why they matter

A surrogate key is a warehouse-generated identifier (usually an integer) used as the primary key of a dimension and referenced by facts. It decouples analytics from operational key changes and enables historical versioning.

Key type	Example	Pros	Risks
Natural/business key	`customer_id`, `sku`	Meaningful, stable in some systems	Can change, can be reused, may not be unique across sources
Surrogate key	`customer_key`, `product_key`	Stable in warehouse, supports SCD versioning, faster joins	Requires mapping logic and careful ETL/ELT

SCD Type 1 vs Type 2 (most common in BI)

SCD Type 1 overwrites attributes (no history). Use it when historical reporting should reflect the latest value (e.g., fixing misspellings).

SCD Type 2 creates a new dimension row when tracked attributes change, preserving history. Facts link to the dimension version that was current at the time of the event.

SCD Type 2 structure and example

dim_customer (SCD2)
- customer_key (surrogate PK)
- customer_id (natural key)
- segment
- region
- effective_start_date
- effective_end_date
- is_current

Example: Customer C123 changes region from West to East on 2025-03-01.

customer_key	customer_id	region	effective_start_date	effective_end_date	is_current
101	C123	West	2024-01-01	2025-02-28	false
245	C123	East	2025-03-01	9999-12-31	true

When loading facts, you look up the correct customer_key based on customer_id and the fact’s event date (e.g., order date). This is what makes “revenue by region at time of sale” accurate.

Practical SCD decisions

Choose tracked attributes intentionally: Not every column needs Type 2. Track attributes that materially affect analysis (segment, region, account owner), not volatile fields (last_login_time).
Late-arriving dimensions: If a fact arrives before the dimension row exists, insert an “unknown” or “late arriving” placeholder dimension row and backfill later.
Current vs historical reporting: Many BI models expose both “current” attributes and “as-of” attributes. For example, keep dim_customer_current (Type 1) for operational-style rollups and dim_customer (Type 2) for historical truth—if your use cases demand both.

Join Types, Filter Propagation, and Dashboard Numbers

BI dashboards often look “wrong” not because measures are incorrect, but because joins and filters change which rows are included. Understanding join behavior is essential for trustworthy results.

Join types and what they imply

Inner join: Keeps only fact rows that match a dimension row. Useful when you want to exclude unmapped/invalid keys, but risky if dimension loading lags (you may silently drop facts).
Left join (fact → dimension): Keeps all fact rows, even if the dimension is missing. This is typically safer for BI because it avoids losing transactions; unmatched rows can map to an “Unknown” dimension member.
Full outer join: Rare in star schemas for reporting; more common in data reconciliation.

Filter propagation (how slicers affect measures)

Most BI tools propagate filters along relationships. In a star schema, filters typically flow from dimensions to facts (e.g., selecting dim_product.category = 'Bikes' filters the fact rows to those with matching product_key).

Key behaviors to watch:

One-to-many direction: Dimension (one) to fact (many) is the standard. If your tool allows bidirectional filters, use them cautiously; they can create ambiguous paths and unexpected exclusions.
Multiple fact tables sharing a dimension: Filtering a shared dimension affects all connected facts. This is desirable for conformed dimensions, but can surprise users if a slicer intended for sales also filters support tickets.
Many-to-many relationships: Often arise with bridge tables (e.g., customers belonging to multiple segments). If not modeled carefully, totals can duplicate. Prefer explicit bridge tables with controlled aggregation logic.

Common dashboard issues caused by joins

Symptom	Likely cause	Fix
Totals are too high	Fan-out join (fact joins to multiple dimension rows), or many-to-many without a bridge strategy	Ensure dimension uniqueness at join key; use bridge table; aggregate fact before joining when appropriate
Totals are too low	Inner join dropping unmatched facts; filters applied through unintended path	Use left join; add “Unknown” members; review relationship directions and active paths
Numbers change when adding a dimension to a table visual	Non-additive measure or grain mismatch; join introduces duplication	Confirm fact grain; use distinct counts carefully; validate join cardinality
Some slicers appear to do nothing	Dimension not connected, inactive relationship, or key mismatch	Validate keys, relationship activation, and data types; test with simple queries

Step-by-Step Modeling Exercise Outline

Use this outline as a repeatable workflow when designing a BI model for a new subject area (or refactoring an existing one). The goal is to produce a star schema that answers business questions with stable, explainable numbers.

1) Define business questions (and the decisions they support)

Write 5–10 questions in plain language and identify the primary metric and slicing dimensions for each.

“What is net revenue by product category and month?”
“Which customer segments have the highest return rate?”
“How does promotion channel impact average order value?”

Practical tip: For each question, note the required time perspective: “as-of event date” (historical) vs “current attributes.” This drives SCD choices.

2) Set the grain explicitly (before listing columns)

Choose the atomic level of the fact table. Write it as: “One row per …”

Good: “One row per order line item.”
Good: “One row per support ticket.”
Risky: “One row per customer per day” (this is an aggregate fact; valid, but you must define how it’s computed and refreshed).

Practical check: If you cannot describe how a single source record maps to one fact row, the grain is not yet clear.

3) Identify dimensions (the nouns) and their hierarchies

List the entities used to filter/group results and the attributes users expect to see.

Date: year → quarter → month → day
Product: category → subcategory → sku
Customer: segment → region → customer
Channel/Store: channel → store

Decide which dimensions must be conformed across domains (e.g., customer, product, date) and document shared definitions.

4) Design the fact table (measures, keys, and additive behavior)

For the chosen grain, define:

Foreign keys to each dimension (surrogate keys preferred).
Measures and whether they are additive across dimensions (sum), semi-additive (sum across some dimensions but not time), or non-additive (ratios).
Derived measures policy: store base measures (e.g., gross, discount) and compute ratios (e.g., margin %) in the semantic layer to avoid inconsistencies.

Example measures for order lines:

Additive: quantity, net_amount
Non-additive: unit_price (should be computed as net_amount / quantity at query time)

5) Choose SCD strategy and keying rules

For each dimension, decide:

Type 1 vs Type 2 (or hybrid).
Which attributes trigger a Type 2 row.
How facts will look up the correct dimension version (usually by natural key + event date).

Write the lookup rule explicitly, for example:

Find dim_customer.customer_key
where dim_customer.customer_id = src.customer_id
and src.order_date between effective_start_date and effective_end_date

6) Validate joins and cardinality (prevent fan-out)

Before building dashboards, prove the relationships are what you think they are.

Each dimension join key should be unique in the dimension (one row per surrogate key).
Fact foreign keys should match dimension surrogate keys (data type and domain).
Watch for accidental duplicates in dimensions (often caused by bad SCD logic or missing natural key constraints).

Practical SQL checks:

-- Dimension uniqueness
select customer_key, count(*)
from dim_customer
group by customer_key
having count(*) > 1;

-- Natural key overlap (SCD2 should not have overlapping effective ranges)
select customer_id
from dim_customer
group by customer_id
having sum(case when is_current then 1 else 0 end) > 1;

7) Test sample queries to confirm numbers match expectations

Create a small test plan that compares warehouse results to a trusted source (finance report, operational extract, or reconciled totals). Start with simple totals, then add slices.

Test A: Base total (no joins): sum net revenue for a date range directly from the fact.
Test B: Join to date: group by month and confirm the monthly totals sum to the base total.
Test C: Join to product: group by category; confirm totals match and no category causes duplication.
Test D: SCD behavior: pick a customer known to have changed region; confirm historical orders roll up to the correct region at the time.

Example query pattern (SQL):

-- A) Base total
select sum(net_amount) as net_revenue
from fact_sales_order_line
where order_date_key between 20250101 and 20250131;

-- B) Monthly rollup via date dimension
select d.year, d.month, sum(f.net_amount) as net_revenue
from fact_sales_order_line f
join dim_date d on f.order_date_key = d.date_key
group by d.year, d.month
order by d.year, d.month;

-- C) Category rollup via product dimension
select p.category, sum(f.net_amount) as net_revenue
from fact_sales_order_line f
join dim_product p on f.product_key = p.product_key
group by p.category
order by net_revenue desc;

If any rollup total differs from the base total, investigate join cardinality, missing dimension rows (inner join vs left join), or grain mismatch (e.g., joining an order-line fact to an order-header dimension without a proper key).

Now answer the exercise about the content:

In a star schema dashboard, why is a left join from the fact table to a dimension often safer than an inner join?

You are right! Congratulations, now go to the next page

You missed! Try again.

A left join (fact → dimension) keeps all fact rows even if the dimension row is missing, which helps prevent totals from being too low due to dropped facts. Unmatched rows can be assigned to an “Unknown” member.