Star Schema Modeling in Practice
A star schema is a BI-friendly way to organize analytical data so that queries are predictable, fast, and easy to reason about. It centers on a fact table (events you measure) surrounded by dimension tables (descriptive context used to slice and filter). The “star” shape comes from many dimensions joining into one central fact.
Core components and how they behave
- Fact table: Contains numeric measures (e.g., revenue, quantity) and foreign keys to dimensions. Facts are typically large and append-heavy.
- Dimension table: Contains attributes used for grouping/filtering (e.g., product category, customer segment). Dimensions are typically smaller and change more slowly.
- Relationships: In a classic star, each fact row joins to exactly one row in each relevant dimension (many-to-one from fact to dimension).
Example: Sales star schema
Imagine an order line fact table at the grain “one row per order line item.”
fact_sales_order_line
- order_line_id (degenerate or operational id)
- order_date_key
- ship_date_key
- customer_key
- product_key
- store_key
- promotion_key
- quantity
- gross_amount
- discount_amount
- net_amountDimensions provide the descriptive context:
dim_date
- date_key (surrogate)
- calendar_date
- year, quarter, month, week, day_of_week, is_holiday
dim_customer
- customer_key (surrogate)
- customer_id (natural/business key)
- name
- segment
- region
- effective_start_date, effective_end_date, is_current (if tracking history)
dim_product
- product_key (surrogate)
- sku (natural/business key)
- brand
- category
- subcategory
- size
dim_store
- store_key (surrogate)
- store_id
- store_type
- city
- state
dim_promotion
- promotion_key (surrogate)
- promotion_code
- channel
- discount_typeTwo practical notes:
- Role-playing dimensions: The same dimension can be used multiple times with different meaning, e.g.,
order_date_keyandship_date_keyboth join todim_date. - Degenerate dimensions: Some identifiers (like
order_id) may live in the fact table without a separate dimension when they are used mainly for drill-through rather than filtering/grouping.
Conformed Dimensions for Cross-Domain Reporting
Conformed dimensions are shared dimensions that mean the same thing across multiple fact tables (and often across multiple business domains). They are the key to answering questions that span processes without “apples vs oranges” definitions.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
What “conformed” means in practice
- Same grain and keys: The dimension represents the same entity at the same level (e.g., one row per customer version).
- Consistent attribute definitions: “Region,” “Segment,” “Active customer,” etc. are defined once and reused.
- Reusable surrogate keys: Facts from different domains reference the same surrogate key space (or a reliably mapped key space) so joins align.
Example: Sales + Support with a conformed customer dimension
Suppose you have two fact tables:
fact_sales_order_line (customer_key, product_key, order_date_key, net_amount, ...)
fact_support_ticket (customer_key, ticket_date_key, resolution_time_minutes, ...)If both facts join to the same dim_customer, you can build metrics like “Net revenue by customer segment and average resolution time” with consistent segment logic.
Common pitfalls and how to avoid them
- Different customer definitions: Sales may treat “customer” as billing account; support may treat “customer” as end user. If they differ, create separate dimensions (e.g.,
dim_accountanddim_end_user) and be explicit in reporting. - Partial conformance: Sometimes only a subset of attributes can be shared. Keep the shared core in the conformed dimension and put domain-specific attributes in “outrigger” dimensions or separate domain dimensions to avoid bloating.
- Inconsistent hierarchies: If product categories differ by channel, decide whether to standardize a single hierarchy or maintain multiple hierarchies with clear naming (e.g.,
category_retailvscategory_wholesale).
Surrogate Keys and Slowly Changing Dimensions (SCD)
In BI, you often need to preserve history: customers move regions, products change categories, stores change ownership. Surrogate keys and SCD patterns help you keep facts stable while dimensions evolve.
Surrogate keys: why they matter
A surrogate key is a warehouse-generated identifier (usually an integer) used as the primary key of a dimension and referenced by facts. It decouples analytics from operational key changes and enables historical versioning.
| Key type | Example | Pros | Risks |
|---|---|---|---|
| Natural/business key | customer_id, sku | Meaningful, stable in some systems | Can change, can be reused, may not be unique across sources |
| Surrogate key | customer_key, product_key | Stable in warehouse, supports SCD versioning, faster joins | Requires mapping logic and careful ETL/ELT |
SCD Type 1 vs Type 2 (most common in BI)
SCD Type 1 overwrites attributes (no history). Use it when historical reporting should reflect the latest value (e.g., fixing misspellings).
SCD Type 2 creates a new dimension row when tracked attributes change, preserving history. Facts link to the dimension version that was current at the time of the event.
SCD Type 2 structure and example
dim_customer (SCD2)
- customer_key (surrogate PK)
- customer_id (natural key)
- segment
- region
- effective_start_date
- effective_end_date
- is_currentExample: Customer C123 changes region from West to East on 2025-03-01.
| customer_key | customer_id | region | effective_start_date | effective_end_date | is_current |
|---|---|---|---|---|---|
| 101 | C123 | West | 2024-01-01 | 2025-02-28 | false |
| 245 | C123 | East | 2025-03-01 | 9999-12-31 | true |
When loading facts, you look up the correct customer_key based on customer_id and the fact’s event date (e.g., order date). This is what makes “revenue by region at time of sale” accurate.
Practical SCD decisions
- Choose tracked attributes intentionally: Not every column needs Type 2. Track attributes that materially affect analysis (segment, region, account owner), not volatile fields (last_login_time).
- Late-arriving dimensions: If a fact arrives before the dimension row exists, insert an “unknown” or “late arriving” placeholder dimension row and backfill later.
- Current vs historical reporting: Many BI models expose both “current” attributes and “as-of” attributes. For example, keep
dim_customer_current(Type 1) for operational-style rollups anddim_customer(Type 2) for historical truth—if your use cases demand both.
Join Types, Filter Propagation, and Dashboard Numbers
BI dashboards often look “wrong” not because measures are incorrect, but because joins and filters change which rows are included. Understanding join behavior is essential for trustworthy results.
Join types and what they imply
- Inner join: Keeps only fact rows that match a dimension row. Useful when you want to exclude unmapped/invalid keys, but risky if dimension loading lags (you may silently drop facts).
- Left join (fact → dimension): Keeps all fact rows, even if the dimension is missing. This is typically safer for BI because it avoids losing transactions; unmatched rows can map to an “Unknown” dimension member.
- Full outer join: Rare in star schemas for reporting; more common in data reconciliation.
Filter propagation (how slicers affect measures)
Most BI tools propagate filters along relationships. In a star schema, filters typically flow from dimensions to facts (e.g., selecting dim_product.category = 'Bikes' filters the fact rows to those with matching product_key).
Key behaviors to watch:
- One-to-many direction: Dimension (one) to fact (many) is the standard. If your tool allows bidirectional filters, use them cautiously; they can create ambiguous paths and unexpected exclusions.
- Multiple fact tables sharing a dimension: Filtering a shared dimension affects all connected facts. This is desirable for conformed dimensions, but can surprise users if a slicer intended for sales also filters support tickets.
- Many-to-many relationships: Often arise with bridge tables (e.g., customers belonging to multiple segments). If not modeled carefully, totals can duplicate. Prefer explicit bridge tables with controlled aggregation logic.
Common dashboard issues caused by joins
| Symptom | Likely cause | Fix |
|---|---|---|
| Totals are too high | Fan-out join (fact joins to multiple dimension rows), or many-to-many without a bridge strategy | Ensure dimension uniqueness at join key; use bridge table; aggregate fact before joining when appropriate |
| Totals are too low | Inner join dropping unmatched facts; filters applied through unintended path | Use left join; add “Unknown” members; review relationship directions and active paths |
| Numbers change when adding a dimension to a table visual | Non-additive measure or grain mismatch; join introduces duplication | Confirm fact grain; use distinct counts carefully; validate join cardinality |
| Some slicers appear to do nothing | Dimension not connected, inactive relationship, or key mismatch | Validate keys, relationship activation, and data types; test with simple queries |
Step-by-Step Modeling Exercise Outline
Use this outline as a repeatable workflow when designing a BI model for a new subject area (or refactoring an existing one). The goal is to produce a star schema that answers business questions with stable, explainable numbers.
1) Define business questions (and the decisions they support)
Write 5–10 questions in plain language and identify the primary metric and slicing dimensions for each.
- “What is net revenue by product category and month?”
- “Which customer segments have the highest return rate?”
- “How does promotion channel impact average order value?”
Practical tip: For each question, note the required time perspective: “as-of event date” (historical) vs “current attributes.” This drives SCD choices.
2) Set the grain explicitly (before listing columns)
Choose the atomic level of the fact table. Write it as: “One row per …”
- Good: “One row per order line item.”
- Good: “One row per support ticket.”
- Risky: “One row per customer per day” (this is an aggregate fact; valid, but you must define how it’s computed and refreshed).
Practical check: If you cannot describe how a single source record maps to one fact row, the grain is not yet clear.
3) Identify dimensions (the nouns) and their hierarchies
List the entities used to filter/group results and the attributes users expect to see.
- Date: year → quarter → month → day
- Product: category → subcategory → sku
- Customer: segment → region → customer
- Channel/Store: channel → store
Decide which dimensions must be conformed across domains (e.g., customer, product, date) and document shared definitions.
4) Design the fact table (measures, keys, and additive behavior)
For the chosen grain, define:
- Foreign keys to each dimension (surrogate keys preferred).
- Measures and whether they are additive across dimensions (sum), semi-additive (sum across some dimensions but not time), or non-additive (ratios).
- Derived measures policy: store base measures (e.g., gross, discount) and compute ratios (e.g., margin %) in the semantic layer to avoid inconsistencies.
Example measures for order lines:
- Additive:
quantity,net_amount - Non-additive:
unit_price(should be computed asnet_amount / quantityat query time)
5) Choose SCD strategy and keying rules
For each dimension, decide:
- Type 1 vs Type 2 (or hybrid).
- Which attributes trigger a Type 2 row.
- How facts will look up the correct dimension version (usually by natural key + event date).
Write the lookup rule explicitly, for example:
Find dim_customer.customer_key
where dim_customer.customer_id = src.customer_id
and src.order_date between effective_start_date and effective_end_date6) Validate joins and cardinality (prevent fan-out)
Before building dashboards, prove the relationships are what you think they are.
- Each dimension join key should be unique in the dimension (one row per surrogate key).
- Fact foreign keys should match dimension surrogate keys (data type and domain).
- Watch for accidental duplicates in dimensions (often caused by bad SCD logic or missing natural key constraints).
Practical SQL checks:
-- Dimension uniqueness
select customer_key, count(*)
from dim_customer
group by customer_key
having count(*) > 1;
-- Natural key overlap (SCD2 should not have overlapping effective ranges)
select customer_id
from dim_customer
group by customer_id
having sum(case when is_current then 1 else 0 end) > 1;7) Test sample queries to confirm numbers match expectations
Create a small test plan that compares warehouse results to a trusted source (finance report, operational extract, or reconciled totals). Start with simple totals, then add slices.
- Test A: Base total (no joins): sum net revenue for a date range directly from the fact.
- Test B: Join to date: group by month and confirm the monthly totals sum to the base total.
- Test C: Join to product: group by category; confirm totals match and no category causes duplication.
- Test D: SCD behavior: pick a customer known to have changed region; confirm historical orders roll up to the correct region at the time.
Example query pattern (SQL):
-- A) Base total
select sum(net_amount) as net_revenue
from fact_sales_order_line
where order_date_key between 20250101 and 20250131;
-- B) Monthly rollup via date dimension
select d.year, d.month, sum(f.net_amount) as net_revenue
from fact_sales_order_line f
join dim_date d on f.order_date_key = d.date_key
group by d.year, d.month
order by d.year, d.month;
-- C) Category rollup via product dimension
select p.category, sum(f.net_amount) as net_revenue
from fact_sales_order_line f
join dim_product p on f.product_key = p.product_key
group by p.category
order by net_revenue desc;If any rollup total differs from the base total, investigate join cardinality, missing dimension rows (inner join vs left join), or grain mismatch (e.g., joining an order-line fact to an order-header dimension without a proper key).