All courses > Technology and Programming > Data Science and Business Intelligence ::

From Raw Data to BI-Ready Data: Warehouses, Marts, and the Semantic Layer

Capítulo 4

Estimated reading time: 10 minutes

The Analytics Data Pipeline as Layers

Reliable analytics rarely comes from a single “analytics database.” It comes from a sequence of layers, each with a specific purpose: (1) capture data as-is, (2) clean and standardize it, (3) curate it for specific analytical use cases, and (4) expose consistent business meaning through a semantic/metrics layer. Thinking in layers helps you control data quality, reduce duplicated logic, and make reporting consistent across teams.

Layer	Primary goal	Typical users	Common characteristics
Raw / Landing	Ingest data quickly and safely	Data engineers	Append-only, minimal transformation, source-aligned schemas
Cleaned / Conformed	Standardize, deduplicate, align keys and reference data	Data engineers, analytics engineers	Quality checks, conformed dimensions, consistent types/timezones
Curated Marts	Optimize for domain analytics and performance	Analysts, BI developers	Subject-area models, pre-joined or star-like structures, business-ready fields
Semantic / Metrics Layer	Define metrics and dimensions once, reuse everywhere	Analysts, BI tools, data consumers	Central metric definitions, governed dimensions, consistent filters

Layer 1: Raw / Landing Data (As-Is, Traceable, Recoverable)

The raw (or landing) layer is where data arrives from source systems with minimal changes. The goal is not to make it “nice”; it is to make it available, auditable, and replayable. If downstream logic changes, you can rebuild clean layers without re-extracting from the source.

What belongs in raw

Source-aligned tables (e.g., orders, order_lines, products) with the same columns as the source.
Ingestion metadata: load timestamp, source file name, batch id, extraction window.
Minimal type normalization only when required to store data (e.g., parse JSON into a variant column).

Practical checks at this layer

Completeness: did the expected tables/files arrive?
Volume anomalies: row counts compared to prior loads.
Schema drift detection: new/missing columns, changed types.

Raw data is often not suitable for direct BI because it can contain duplicates, late-arriving updates, inconsistent codes, and mixed timezones/currencies.

Layer 2: Cleaned and Conformed Data (Standardized and Joinable)

The cleaned/conformed layer turns raw inputs into data that is consistent across sources and stable for downstream modeling. “Cleaned” focuses on correctness and usability; “conformed” focuses on shared definitions and keys so that different subject areas can be analyzed together.

Common transformations

Deduplication and change handling: pick the latest record per business key, handle updates and deletes.
Type and format standardization: timestamps to a standard timezone, numeric precision, standardized country/state codes.
Reference data alignment: map source codes to enterprise codes (e.g., channel codes, product categories).
Key management: create stable surrogate keys for shared dimensions (e.g., product, customer) if needed.

Conformed dimensions (why they matter)

Conformed dimensions are shared dimension tables (or shared dimension logic) used consistently across multiple marts. For example, if “Product” is defined differently in marketing and finance, you will get conflicting results when comparing performance across domains. Conformance ensures that “Product A” means the same thing everywhere, with the same hierarchy and attributes.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Examples of conformance decisions

Date: a single calendar with fiscal attributes (fiscal year/period/week) used across all marts.
Product: one product master with consistent category hierarchy and active/inactive flags.
Currency: standard conversion rules and a consistent “reporting currency” approach.

Layer 3: Curated Data Marts (Domain-Optimized BI-Ready Data)

A curated mart is a subject-area dataset designed for a specific analytical domain (sales, finance, inventory, support). It is shaped to match how people ask questions, often with pre-modeled relationships, standardized naming, and performance considerations (partitioning, clustering, aggregates).

What makes a mart “BI-ready”

Business-friendly fields: clear names, decoded statuses, consistent units.
Stable grain choices: the mart is explicit about what one row represents.
Reusable joins: dimensions connect cleanly to facts with consistent keys.
Performance: fewer complex joins for common dashboards, optional aggregates for frequent queries.

Warehouse vs Data Mart (and When to Use Each)

What a data warehouse is

A data warehouse is the central analytical store that integrates multiple sources and domains. It typically contains the raw-to-cleaned-to-curated progression (or at least the cleaned/conformed and curated layers) and provides shared, governed data assets used across the organization.

What a data mart is

A data mart is a subset of analytical data focused on a specific domain or team’s needs. A mart can live inside the warehouse (as curated schemas/tables) or be physically separate (a separate database or project), depending on governance and architecture.

When a warehouse-first approach is appropriate

Multiple domains need to reconcile (e.g., sales vs finance vs operations).
Shared dimensions are critical (product, customer, date, geography).
Governance and auditability matter (regulated reporting, executive KPIs).
Many downstream consumers need consistent definitions.

When a mart-first approach can be appropriate

A single domain needs fast delivery and has limited cross-domain dependencies.
Exploratory analytics where the model will evolve quickly.
Organizational boundaries require separation (cost allocation, access control).

Common pitfalls

Mart sprawl: multiple marts define “revenue” differently, causing metric conflicts.
Hidden transformations: business logic embedded in dashboards instead of data layers.
Inconsistent dimensions: different product hierarchies across marts.

The Semantic Layer / Metrics Layer (Business Meaning, Standardized)

Even with well-built marts, inconsistency often reappears when different reports implement their own calculations. The semantic layer (sometimes called a metrics layer) centralizes business definitions so that the same metric yields the same result across tools and teams.

What the semantic layer provides

Standardized metric definitions: one definition of revenue, margin, active customers, conversion rate.
Reusable measures: metrics built from other metrics (e.g., gross_margin = revenue - cogs).
Controlled dimensions: curated dimension attributes and hierarchies exposed consistently (e.g., product category levels).
Consistent filtering rules: definitions like “completed orders only” or “exclude test accounts.”
Governance: ownership, versioning, certification, and documentation.

What belongs in the semantic layer vs in marts

Put in marts	Put in semantic/metrics layer
Clean joins and keys, stable grains, domain tables	Metric formulas and business filters used across many reports
Derived columns that are structural (e.g., fiscal period on date dimension)	Metric naming, formatting, default aggregations, metric dependencies
Domain-specific enrichments (e.g., sales territory assignment)	Certified KPI set for executives and cross-team reporting

Example: standardizing “Revenue”

Without a semantic layer, one dashboard might compute revenue as “sum of order totals,” while another subtracts discounts, and a third excludes refunds. A semantic definition makes the rule explicit and reusable.

Revenue (net) = sum(line_item_amount) - sum(discount_amount) - sum(refund_amount) + sum(shipping_amount_if_included)

The exact components depend on your business policy, but the key is that the definition is centralized and referenced everywhere.

Walkthrough: Building a Simple Sales Performance Mart

This walkthrough shows how to go from cleaned/conformed data to a curated sales mart, using conformed date and product dimensions and a consistent revenue definition. The goal is to support questions like “Revenue by product category over time” and “Top products by net revenue this quarter.”

Step 1: Choose the mart’s purpose and grain

Subject area: Sales performance

Primary questions:

Net revenue trend by day/week/month
Net revenue by product and category
Units sold and average selling price

Grain decision: One row per order line item (order_id + line_number). This grain supports flexible slicing by product and date and allows accurate unit and revenue calculations.

Step 2: Identify required sources (from cleaned/conformed layer)

conformed.orders: order header status, order timestamps, customer id, currency, etc.
conformed.order_lines: product id, quantity, unit price, line discounts, line tax, etc.
conformed.refunds (if applicable): refunded amounts tied to order lines.
conformed.dim_date: calendar and fiscal attributes.
conformed.dim_product: product attributes and hierarchy.

Step 3: Ensure conformed dimensions exist and are stable

Conformed date dimension: must include a single row per date with attributes used across the business (calendar and fiscal). Example fields:

date_key (surrogate or integer like YYYYMMDD)
date
week_start_date, month_start_date
fiscal_year, fiscal_period

Conformed product dimension: must map all order line products to a consistent product master. Example fields:

product_key (surrogate)
product_id (business/source id)
product_name
category, subcategory, brand
is_active

If multiple source systems provide product data, conformance includes matching rules and survivorship logic (which system “wins” for each attribute).

Step 4: Define the revenue policy (consistent definition)

Pick a definition that matches how the business reports sales performance. A common choice for a sales mart is net revenue at the line level:

gross_line_amount = quantity * unit_price
net_line_amount = gross_line_amount - line_discount_amount - refund_amount

Decide explicitly:

Which order statuses count (e.g., only completed and shipped)
Whether to include tax and shipping in revenue
How to handle partial refunds and cancellations

Document these choices and implement them once (preferably as a reusable metric in the semantic/metrics layer, while the mart provides the necessary components).

Step 5: Build the fact table for the mart

Create a curated fact table such as mart_sales.fact_sales_line. It should include:

Keys: order_id, order_line_id, date_key, product_key
Measures/components: quantity, gross_line_amount, discount_amount, refund_amount, net_line_amount
Helpful attributes: order_status, sales_channel (if conformed), currency_code

-- Pseudocode / illustrative SQL structure (adapt to your platform and naming standards) CREATE TABLE mart_sales.fact_sales_line AS SELECT   ol.order_id,   ol.order_line_id,   d.date_key,   p.product_key,   ol.quantity,   (ol.quantity * ol.unit_price) AS gross_line_amount,   COALESCE(ol.discount_amount, 0) AS discount_amount,   COALESCE(r.refund_amount, 0) AS refund_amount,   (ol.quantity * ol.unit_price) - COALESCE(ol.discount_amount, 0) - COALESCE(r.refund_amount, 0) AS net_line_amount,   o.order_status,   o.sales_channel,   o.currency_code FROM conformed.order_lines ol JOIN conformed.orders o   ON o.order_id = ol.order_id JOIN conformed.dim_date d   ON d.date = CAST(o.order_completed_timestamp AS DATE) JOIN conformed.dim_product p   ON p.product_id = ol.product_id LEFT JOIN conformed.refunds r   ON r.order_line_id = ol.order_line_id WHERE o.order_status IN ('completed','shipped');

Step 6: Validate the mart with targeted tests

Reconciliation: total net revenue by day matches a trusted finance extract (within agreed rules).
Uniqueness: order_line_id is unique in fact_sales_line.
Referential integrity: every row has a valid date_key and product_key.
Reasonableness: no negative quantities unless returns are modeled that way; net revenue aligns with discount/refund logic.

Step 7: Add optional aggregates for performance (if needed)

If dashboards frequently query revenue by day and category, consider an aggregate table like mart_sales.agg_sales_daily_category:

Grain: date_key + category
Measures: total_net_revenue, total_units, order_count

Aggregates should be derived from the detailed fact table to avoid diverging logic.

Step 8: Expose metrics through the semantic/metrics layer

Use the semantic layer to define certified metrics that reference the mart’s fields:

Net Revenue: SUM(net_line_amount)
Units Sold: SUM(quantity)
Average Selling Price (ASP): Net Revenue / Units Sold

Also define controlled dimensions and default filters:

Default filter: exclude test products/accounts if applicable
Time logic: standard “this fiscal quarter” based on dim_date
Product hierarchy: category → subcategory → product

This ensures that a “Net Revenue” card in one dashboard matches “Net Revenue” in another, even if built by different teams or tools.

Now answer the exercise about the content:

In a layered analytics data pipeline, which layer primarily centralizes metric formulas and consistent business filters so the same KPI is calculated the same way across tools and teams?

You are right! Congratulations, now go to the next page

You missed! Try again.

The semantic/metrics layer defines standardized metrics, governed dimensions, and consistent filtering rules so KPIs (like revenue) are computed the same way across reports, tools, and teams.