What changes between a star and a snowflake
Both star and snowflake schemas are dimensional models used for BI, but they differ in how “wide” or “normalized” the dimension side is.
- Star schema: fact table in the center, with denormalized dimensions (each dimension is typically one table containing most descriptive attributes).
- Snowflake schema: fact table in the center, but one or more dimensions are normalized into multiple related tables (sub-dimensions / reference tables).
The practical impact is not about correctness (both can be correct), but about trade-offs in query simplicity, performance, storage, and maintenance—especially when dashboards are built by many users with varying SQL skills.
Trade-offs for BI: simplicity, performance, storage, maintenance
1) Query simplicity and semantic clarity
- Star: simplest mental model. BI users typically join the fact to each dimension once. Attributes are “where you expect them,” which improves self-service and reduces modeling mistakes.
- Snowflake: more tables and more joins. Users must know which table contains which attribute (e.g., CategoryName is not in Product anymore). This can reduce clarity unless the semantic layer hides it.
2) Performance
- Star: fewer joins, often faster in BI engines and SQL databases for typical slice-and-dice queries. Optimizers also have an easier time with fewer join paths.
- Snowflake: more joins can increase planning and execution cost. Performance can still be good with proper indexing, caching, and a strong semantic layer, but the baseline is usually more complex.
3) Storage
- Star: denormalization repeats some text attributes (e.g., CategoryName repeated on many Product rows). This can increase storage, though columnar compression often reduces the penalty.
- Snowflake: reduces repetition by moving repeating attributes into separate tables (Category, Brand). Can reduce storage and improve data governance for shared reference data.
4) Maintenance and change management
- Star: fewer objects to manage, simpler ETL/ELT pipelines, simpler access control. But updates to shared reference attributes may require updating many rows in a dimension (though in practice dimensions are reloaded or merged in batch).
- Snowflake: more objects, more pipelines, more keys and relationships to manage. However, shared reference tables can centralize updates (e.g., renaming a Brand once).
Example: snowflaking a Product dimension into Category and Brand
Assume a sales model with a fact table FactSales and a Product dimension. In a star schema, Product is a single table with product attributes, including category and brand descriptors.
Star version (denormalized Product dimension)
FactSales(SalesDateKey, ProductKey, StoreKey, Quantity, SalesAmount, ...)DimProduct(ProductKey, ProductSKU, ProductName, CategoryName, CategoryGroup, BrandName, BrandOwner, PackageSize, ...)A typical BI query needs one join to get category and brand:
SELECT p.CategoryName, p.BrandName, SUM(f.SalesAmount) AS SalesAmount
FROM FactSales f
JOIN DimProduct p ON f.ProductKey = p.ProductKey
GROUP BY p.CategoryName, p.BrandName;Snowflake version (Product snowflaked into Category and Brand)
Now Product keeps only product-level attributes and foreign keys to Category and Brand tables.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
DimProduct(ProductKey, ProductSKU, ProductName, CategoryKey, BrandKey, PackageSize, ...)DimCategory(CategoryKey, CategoryName, CategoryGroup, ...)DimBrand(BrandKey, BrandName, BrandOwner, ...)The same analysis now requires additional joins:
SELECT c.CategoryName, b.BrandName, SUM(f.SalesAmount) AS SalesAmount
FROM FactSales f
JOIN DimProduct p ON f.ProductKey = p.ProductKey
JOIN DimCategory c ON p.CategoryKey = c.CategoryKey
JOIN DimBrand b ON p.BrandKey = b.BrandKey
GROUP BY c.CategoryName, b.BrandName;What this means for joins and BI users
- More join paths: users (or the semantic layer) must traverse Product → Category and Product → Brand.
- More places for errors: missing joins can silently change results (e.g., grouping by Product.CategoryKey instead of CategoryName, or accidentally filtering on the wrong table).
- More dependency on a semantic layer: if you expect self-service, you often need a curated model that exposes CategoryName and BrandName as if they were on Product (even if physically snowflaked).
Practical step-by-step: deciding whether to snowflake Product
Step 1: list the attributes and identify repetition
Write down Product attributes and mark which ones repeat across many products.
- Product-level: SKU, ProductName, Size, Color
- Repeating reference: BrandName, BrandOwner, CategoryName, CategoryGroup
Step 2: check whether reference data is shared across multiple dimensions or domains
Snowflaking is more defensible when the same reference entity is used in many places.
- Brand used by Product, Promotions, Supplier contracts, and Marketing spend allocations
- Category used by Product and Merchandising plans
Step 3: estimate BI usage patterns
- If most dashboards slice by Category and Brand, a star keeps queries simpler and often faster.
- If Category/Brand are mostly used for occasional filtering, snowflaking may be acceptable if the semantic layer hides complexity.
Step 4: evaluate performance and cost with a representative workload
Use a small set of “top queries” (the ones that power dashboards) and compare:
- Number of joins
- Execution time and variability
- Cache hit rates (if applicable)
- Concurrency behavior
Step 5: decide on physical vs semantic snowflaking
You can sometimes keep a physical snowflake for governance while presenting a star-like semantic model to users.
- Physical snowflake + semantic flattening: BI tool exposes CategoryName and BrandName as Product attributes via relationships.
- Physical star + managed reference tables: keep Brand and Category as separate “master data” sources, but load their attributes into DimProduct for BI simplicity.
Decision criteria: when star is preferable vs when limited snowflaking helps
Prefer a star when you optimize for ease, speed, and semantic clarity
- High self-service usage: many analysts build their own visuals and ad-hoc queries.
- Dashboard-heavy environment: lots of repeated queries where fewer joins reduce latency.
- Need for consistent definitions in the open: you want “Category” and “Brand” to be obvious and hard to misuse.
- BI tool relationship limitations: some tools handle many-to-one chains less transparently, or users struggle with multi-hop relationships.
Use limited snowflaking when it reduces duplication and improves governance without harming usability
- Shared reference data: Brand and Category are used across multiple subject areas and must be centrally governed.
- Large repeating attributes: long text descriptors, multilingual labels, or many columns that repeat across many products.
- Frequent reference updates: Brand ownership changes, category reorganizations, or compliance attributes that must be updated consistently.
- Strong semantic layer and curated datasets: most users consume a published model where joins are pre-defined and hidden.
A common compromise is “snowflake only for true reference entities” and keep the rest of the dimensions denormalized for BI friendliness.
Mini case study: two teams, two schemas, measured impact
Scenario setup
A retail company wants new dashboards for Sales Performance. Two teams work in parallel for two weeks using the same source extracts and the same BI tool.
- Team Star: builds a classic star. DimProduct includes CategoryName and BrandName directly.
- Team Snowflake: snowflakes DimProduct into DimCategory and DimBrand to reduce repetition and align with master data management.
Both teams must deliver:
- Dashboard A: Sales by Category and Brand over time
- Dashboard B: Top products within Category, filtered by Brand
- Dashboard C: Category mix by Store
How they measured “speed” and “consistency”
They agreed on a simple measurement plan.
- Speed metrics: median query time and 95th percentile query time for the dashboard visuals (captured from the BI tool or database query logs).
- Consistency metrics: number of distinct definitions/fields created for “Category” and “Brand,” and number of mismatched totals found during QA.
What happened during build
Team Star created one dataset with DimProduct already containing Category and Brand attributes. Analysts dragged CategoryName and BrandName directly into visuals.
Team Snowflake created relationships Product → Category and Product → Brand. Some analysts used CategoryKey fields from Product, others used CategoryName from Category. A few visuals accidentally filtered on BrandName in DimBrand without the correct relationship activated in a specific dataset copy.
Observed outcomes (representative results)
- Dashboard development time: Team Star finished earlier because fewer relationship issues and fewer “where is this attribute?” questions.
- Query performance: Team Star had lower median latency on visuals dominated by Category/Brand group-bys (fewer joins). Team Snowflake showed more variability at peak concurrency due to additional joins and more complex query plans.
- Consistency: Team Star had fewer duplicated fields/definitions. Team Snowflake needed stricter dataset governance to prevent multiple ways of representing Category and Brand.
- Governance benefit: When BrandOwner changed for several brands, Team Snowflake updated DimBrand once and all dependent dashboards reflected the change immediately. Team Star had to reload/merge DimProduct to propagate the updated BrandOwner attributes.
What they changed after the comparison
They adopted a hybrid approach:
- Keep a star-like published BI dataset where CategoryName and BrandName are exposed as Product attributes for ease of use.
- Maintain DimBrand and DimCategory as governed reference tables in the warehouse for stewardship and cross-domain reuse.
- Automate a pipeline step that flattens Brand and Category attributes into the published DimProduct (or a view) so BI users get star simplicity while data stewards keep reference integrity.