All courses > Technology and Programming > Data Science and Business Intelligence ::

Keys in Dimensional Models: Natural Keys, Surrogate Keys, and Referential Integrity

Capítulo 7

Estimated reading time: 8 minutes

+ Exercise

Natural (Business) Keys vs Surrogate Keys

A key identifies a row. In dimensional models, you typically deal with two kinds of identifiers for dimensions:

Natural (business) key: An identifier that comes from the business or a source system (e.g., CustomerNumber, EmailAddress, ProductCode, SKU). It has meaning outside the warehouse.
Surrogate key: A warehouse-generated identifier (usually an integer) with no business meaning (e.g., CustomerKey, ProductKey). It is used as the primary key of the dimension and as the foreign key in fact tables.

Warehouses often prefer surrogate keys in dimensions because natural keys are not stable enough for long-term, integrated analytics. Natural keys can change, be reused, collide across systems, or arrive late. Surrogate keys provide a stable, warehouse-controlled identifier that supports consistent joins and controlled history.

What “referential integrity” means in a warehouse

Referential integrity means every foreign key value in a fact table matches a valid primary key value in the referenced dimension (or a defined “Unknown”/“Not Applicable” row). In practice, this ensures that facts always join to a dimension row, preventing lost measures and inconsistent totals in reports.

Why Surrogate Keys Are Preferred in Dimensions

1) History tracking (SCD) without breaking fact relationships

When a dimension attribute changes (e.g., customer segment, product category), you often want to preserve history for reporting “as it was at the time.” With surrogate keys, you can store multiple versions of the same business entity, each with its own surrogate key, and point facts to the correct version.

Common pattern: the dimension keeps the business key plus effective dates and a current flag. Each version gets a new surrogate key. Facts keep the surrogate key that was valid at the event time.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

2) Late-arriving data and “unknown” handling

Sometimes facts arrive before the corresponding dimension record (e.g., a sale arrives before the customer master record is loaded). If you require strict referential integrity, you still need a valid dimension key to load the fact. Surrogate keys enable a controlled approach:

Insert an Unknown dimension row (e.g., CustomerKey = 0) and load the fact referencing it.
Optionally insert an inferred member (a placeholder dimension row for a specific business key) so facts can point to that member immediately, then enrich it later when the full dimension record arrives.

3) Avoiding source-system key collisions and reuse

In integrated warehouses, the same numeric ID can exist in multiple source systems with different meanings. Also, some systems reuse IDs after deletion or migration. If you used natural keys directly as dimension primary keys, collisions would cause incorrect joins and blended reporting. Surrogate keys isolate the warehouse from these issues while still storing the original business identifiers as attributes for traceability.

Structured Loading Guide: From Business Key to Surrogate Key to Facts

Step 1: Choose the business key (and define its scope)

Pick a stable identifier that represents the business entity. If you have multiple source systems, define whether the business key is:

Enterprise-wide (one identifier shared across systems), or
Source-scoped (you must include SourceSystem in the key to make it unique).

Practical rule: if two systems can produce the same identifier value for different entities, your business key must be composite (e.g., SourceSystem + SourceCustomerId).

Step 2: Generate surrogate keys in the warehouse

Create a surrogate key column in each dimension (e.g., CustomerKey, ProductKey). Typical implementation is an integer identity/sequence. The surrogate key is assigned by the warehouse when a new dimension row/version is inserted.

Also create a unique constraint (or equivalent) on the business key columns that define uniqueness for a version. For SCD Type 2, uniqueness is often enforced on (BusinessKey, EffectiveStartDate) or (BusinessKey, IsCurrent).

Step 3: Load the dimension (upsert with history rules)

Dimension load logic typically:

Lookup by business key (and, for SCD2, current row).
If not found: insert a new row and assign a new surrogate key.
If found and attributes changed (per your change detection rules): expire the current row and insert a new version with a new surrogate key.
If found and no relevant changes: do nothing (or update non-historized attributes).

Maintain “Unknown” (and optionally “Not Applicable”) rows with fixed surrogate keys (commonly 0 and -1) so facts can always reference a valid dimension row.

Step 4: Load facts with correct foreign keys

When loading a fact row:

Determine the business key(s) for each dimension reference from the source event.
Lookup the dimension surrogate key using the business key and (if needed) the event date to select the correct historical version.
If no dimension row exists yet: use Unknown (0) or create an inferred member and use its surrogate key.
Insert the fact row with the surrogate key foreign keys.

This is where referential integrity is enforced: every fact foreign key must resolve to a dimension surrogate key.

Example 1: Customer Dimension — Source IDs Change and Late-Arriving Customers

Scenario

You receive sales transactions from an Order system. Customer master data comes from a CRM system. The CRM migrates and changes customer IDs, and sometimes orders arrive before the CRM record is available.

Order system provides: SourceSystem = 'ORDER', CustomerId = 12345
CRM provides: SourceSystem = 'CRM', CustomerId = 12345 (different person than ORDER’s 12345) and later migrates to CustomerId = 98765

If you used CustomerId alone as the dimension key, you would collide across systems and misattribute sales.

Dimension design (relevant columns)

DimCustomer( CustomerKey PK, SourceSystem, SourceCustomerId, CustomerName, Segment, EffectiveStartDate, EffectiveEndDate, IsCurrent )

Business key choice: (SourceSystem, SourceCustomerId). This prevents collisions.

Late-arriving order fact before customer master

An order arrives with (SourceSystem='ORDER', SourceCustomerId=12345), but DimCustomer has no row yet.

Option A (simpler): load fact with CustomerKey = 0 (Unknown).
Option B (better traceability): create an inferred member in DimCustomer with minimal attributes and a new CustomerKey, then load the fact referencing that key.

With inferred members, you preserve the ability to later update that same customer row (or replace it with a full SCD2 process) without losing the fact-to-customer linkage.

Customer ID changes in CRM (history tracking)

Suppose CRM changes a customer’s identifier from 12345 to 98765 due to migration. In the warehouse, you do not want historical facts to “move” to a different customer unintentionally, and you also want consistent reporting for the same real-world person.

Two common approaches:

Keep source-scoped business keys: treat (CRM,12345) and (CRM,98765) as different business keys, and use a separate cross-reference mapping to link them to a single enterprise customer identity.
Adopt an enterprise business key: if you have a stable enterprise identifier (e.g., GlobalCustomerId), use that as the business key and store CRM IDs as attributes. Then the surrogate key versions track attribute changes, not identifier churn.

In both cases, facts continue to reference surrogate keys, so historical joins remain stable. Reporting can be made consistent by grouping on the chosen enterprise identifier or by using the current row logic depending on the reporting requirement.

Example 2: Product Dimension — Overlapping SKUs Across Systems and Reclassification

Scenario

You integrate product data from two ERPs. Both use numeric ProductId values that overlap (e.g., ProductId=100 exists in both systems but represents different items). Additionally, product category assignments change over time and must be reported historically.

Dimension design (relevant columns)

DimProduct( ProductKey PK, SourceSystem, SourceProductId, SKU, ProductName, Category, Brand, EffectiveStartDate, EffectiveEndDate, IsCurrent )

Business key choice: (SourceSystem, SourceProductId). SKU may not be unique across systems, and ProductName can change, so neither is safe as a primary identifier.

Collision avoidance

ERP_A sends SourceProductId=100 for “Widget A”, ERP_B sends SourceProductId=100 for “Gadget B”. Because the business key includes SourceSystem, these become two distinct business keys and therefore two distinct surrogate keys:

(ERP_A,100) → ProductKey = 5012
(ERP_B,100) → ProductKey = 8840

Facts from each ERP join to the correct product row, preventing blended metrics.

Category changes with history preserved

Suppose (ERP_A,100) changes Category from “Accessories” to “Components” on 2025-07-01. With SCD2, you expire the old row and insert a new version:

Old version: ProductKey=5012, Category=Accessories, EffectiveEndDate=2025-06-30, IsCurrent=0
New version: ProductKey=9321, Category=Components, EffectiveStartDate=2025-07-01, IsCurrent=1

When loading facts, you lookup ProductKey using the transaction date to select the correct version. Sales before 2025-07-01 report under “Accessories,” and sales after report under “Components,” without rewriting facts.

Practical Lookup Patterns for Fact Loading

Lookup current dimension row (no historical versioning needed)

SELECT CustomerKey FROM DimCustomer WHERE SourceSystem = ? AND SourceCustomerId = ? AND IsCurrent = 1;

Lookup dimension row as-of event date (SCD2)

SELECT ProductKey FROM DimProduct WHERE SourceSystem = ? AND SourceProductId = ? AND EffectiveStartDate <= ? AND EffectiveEndDate >= ?;

Enforcing referential integrity with Unknown / inferred members

If the lookup returns no row:

Use the Unknown key (0) to keep the load moving and preserve totals, or
Insert an inferred dimension row for that business key and return its new surrogate key.

Either way, the fact table always contains valid foreign keys, and reporting remains consistent even when dimension data arrives late or changes over time.

Now answer the exercise about the content:

In a dimensional warehouse, what is the main purpose of using surrogate keys in dimension tables instead of relying on natural (business) keys?

You are right! Congratulations, now go to the next page

You missed! Try again.

Surrogate keys are warehouse-generated and stable, so facts can always reference a valid dimension row (including Unknown/inferred members) and can point to the correct historical SCD version without breaking joins.