All courses > Technology and Programming > Data Science and Business Intelligence ::

Reliability and Quality in BI: Testing, Data Validation, and Monitoring

Capítulo 9

Estimated reading time: 10 minutes

What “Reliability” and “Quality” Mean in BI

In BI, reliability means users can depend on dashboards and reports to be correct, consistent, and available when expected. Quality means the underlying data meets defined expectations (completeness, uniqueness, validity, timeliness, and consistency) so metrics behave predictably over time.

A practical way to manage this is to treat data like a product with explicit requirements. Instead of assuming pipelines “usually work,” you define checks that must pass before data is published, and you continuously monitor for regressions after release.

Key quality dimensions to operationalize

Completeness: required fields are populated; expected records exist.
Uniqueness: no unintended duplicates at the intended grain.
Validity: values fall within accepted sets or ranges.
Consistency: relationships hold (e.g., foreign keys match), totals reconcile within known tolerances.
Freshness: data arrives on schedule; late data is detected and handled.
Stability: schemas and upstream logic changes do not silently break downstream outputs.

Common Data Issues in BI (and Why They Happen)

Missing data

Symptoms: sudden drops in counts, blank dimension attributes, missing days in time series.

Common causes: upstream extract failures, filtering bugs, nulls introduced by joins, new source values not mapped, permissions changes, partial loads.

Duplicates

Symptoms: inflated revenue/orders, double-counted events, multiple “current” records for the same entity.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Common causes: reprocessing without idempotency, late-arriving updates inserted as new rows, incorrect merge keys, source sending repeated events.

Late-arriving facts

Symptoms: yesterday’s numbers change today; period totals drift; “final” reports are revised.

Common causes: delayed upstream systems, batch windows, mobile/offline events, backfills, corrections posted after the fact.

Schema drift

Symptoms: pipeline failures, columns suddenly null, type mismatches, new fields appear, renamed fields break transformations.

Common causes: upstream releases, vendor API changes, event tracking changes, unannounced database migrations.

Detecting Issues with Validation Rules and Automated Tests

Validation rules are explicit expectations about data. Automated tests run those rules on a schedule (or during pipeline runs) and fail fast when something deviates. A good test suite mixes:

Structural tests: schema, types, required columns.
Content tests: null checks, accepted values, ranges.
Relational tests: referential integrity, uniqueness at grain.
Volume/timeliness tests: row counts, freshness, distribution shifts.
Business-rule tests: domain logic (e.g., refunded orders reduce net revenue).

Designing effective rules: be explicit about grain and tolerance

Most BI incidents come from ambiguous expectations. For each critical table or metric, document:

Grain: what makes a row unique (e.g., order_id, or order_id + line_id).
Freshness SLA: how late is acceptable (e.g., “data complete by 08:00 UTC”).
Allowed change: acceptable day-over-day variance (e.g., ±20% unless flagged as a known event).
Known exceptions: late-arriving updates, cancellations, backfills, partial days.

Core Test Patterns (with Practical Examples)

1) Row count and volume anomaly checks

Goal: detect missing loads, duplicates, or upstream outages.

How: compare today’s row count to recent history, or enforce minimum expected counts per partition (e.g., per day).

-- Example: ensure at least N events arrived for yesterday (partitioned by event_date) SELECT COUNT(*) AS cnt FROM fact_events WHERE event_date = CURRENT_DATE - INTERVAL '1 day';

Rule examples:

Daily partitions must exist for the last 7 days.
Yesterday’s count must be within 50–150% of the 14-day median (tune per dataset).
No single source system contributes 0 rows if it usually contributes > X.

2) Freshness checks

Goal: detect late or stalled pipelines before users notice.

How: track the maximum ingestion timestamp or last successful load marker.

-- Example: freshness based on latest loaded timestamp SELECT MAX(loaded_at) AS last_load FROM fact_orders;

Rule examples:

last_load must be within 2 hours of current time for hourly feeds.
Daily tables must have a completed partition by 08:00 UTC.

3) Referential integrity checks

Goal: ensure facts link to dimensions; prevent “unknown” categories from silently growing.

How: verify foreign keys exist in the referenced dimension (or that unknown handling is intentional and bounded).

-- Example: orphan fact rows SELECT COUNT(*) AS orphan_cnt FROM fact_sales f LEFT JOIN dim_customer c ON f.customer_id = c.customer_id WHERE c.customer_id IS NULL;

Rule examples:

Orphan count must be 0 (or < a small threshold if late dimension updates are expected).
Percentage of “unknown” dimension keys must be < 0.5%.

4) Uniqueness checks (deduplication safety)

Goal: prevent double counting by enforcing uniqueness at the intended grain.

How: assert that the natural key (or surrogate key) is unique.

-- Example: duplicates by business key SELECT order_id, COUNT(*) AS c FROM fact_orders GROUP BY order_id HAVING COUNT(*) > 1;

Rule examples:

order_id must be unique in the “current orders” table.
order_id + line_id must be unique in the order lines table.

5) Accepted values and range checks

Goal: catch invalid codes, broken mappings, and type issues that still “load.”

How: enforce enumerations and numeric boundaries.

-- Example: accepted status values SELECT status, COUNT(*) FROM fact_orders GROUP BY status; -- Validate status in ('created','paid','shipped','cancelled','refunded')

Rule examples:

status must be in an approved set.
quantity must be >= 0 and < 10,000.
discount_pct must be between 0 and 1.
order_total must be >= 0 (or allow negatives only for refunds, explicitly).

6) Schema drift detection

Goal: detect upstream changes early and prevent silent nulls or broken transformations.

How: validate expected columns and types; alert on new/removed columns; enforce contracts for critical feeds.

Rule examples:

Required columns must exist (fail pipeline if missing).
Column types must match expected (e.g., order_date remains a date).
New columns are allowed but must be logged for review (to avoid surprises).

Step-by-Step: Building a Practical Testing Strategy

Step 1: Identify “critical data products”

List the tables and metrics that drive decisions or external reporting (e.g., revenue, active users, inventory). Prioritize tests there first.

Step 2: Define expectations in plain language

For each critical dataset, write 5–10 rules that a business stakeholder would agree with, then translate them into checks. Example for orders:

Every order has an order_id, created_at, and status.
Statuses are from a known list.
Orders link to a customer (or are explicitly “guest”).
Daily order volume should not drop to near zero without an incident.

Step 3: Implement automated tests at the right stages

Ingestion stage: schema, freshness, raw row counts.
Transformation stage: uniqueness, referential integrity, null checks, accepted values.
Semantic/reporting stage: metric-level checks, reconciliation totals, distribution checks.

Step 4: Decide what happens on failure

Not every failure should block publishing. Use severity levels:

Blocker: publish is stopped (e.g., missing partition, broken schema, massive duplicates).
Degraded: publish continues but alerts are sent (e.g., small orphan rate, minor freshness delay).
Informational: logged for review (e.g., new column added upstream).

Step 5: Make tests maintainable

Keep rules close to the dataset they validate.
Version control test definitions with code changes.
Review thresholds quarterly; adjust for seasonality and growth.

Reconciliation: Comparing Warehouse Numbers to Source Systems

Reconciliation is the practice of comparing key totals in the warehouse (or BI layer) to authoritative source systems to confirm that the pipeline is complete and logic is aligned. The goal is not to force exact equality in all cases; it is to explain differences and ensure they stay within expected bounds.

What to reconcile (common examples)

Financial: gross sales, net sales, refunds, tax, shipping, payouts.
Operational: order counts, shipment counts, tickets created/closed.
Customer: new customers, active customers, churn events.

Expected differences (and how to explain them)

Difference type	Why it happens	How to handle
Timing / cutoffs	Source uses local time; warehouse uses UTC; batch windows differ	Reconcile by consistent time zone and define “business day” cutoffs
Late-arriving updates	Orders updated after initial creation; events arrive late	Use a lookback window (e.g., recompute last 7 days) and track revisions
Returns and cancellations	Source may show gross at purchase time; warehouse may show net after returns	Reconcile gross and net separately; document which metric is used where
Data scope	Warehouse excludes test orders, internal users, or certain channels	Ensure filters match; maintain a shared exclusion list
Rounding and currency	Different rounding rules; FX rates applied at different times	Standardize rounding and FX logic; reconcile at both transaction and summary levels

Step-by-step: Implement a reconciliation check

Pick a control total: e.g., “yesterday gross sales” from the payments system.
Align definitions: same time zone, same inclusion/exclusion rules, same currency.
Compute both sides: source total vs warehouse total for the same period.
Calculate variance: absolute difference and percent difference.
Set tolerance: e.g., 0.5% or a fixed amount, plus a separate tolerance for known late-arriving adjustments.
Alert and annotate: if variance exceeds tolerance, trigger an incident; if within tolerance, store the result for trend analysis.

-- Pseudocode: reconciliation result record (date, source_total, wh_total, diff, diff_pct) diff = wh_total - source_total diff_pct = diff / NULLIF(source_total, 0)

Reconciliation tips that prevent false alarms

Reconcile multiple components (gross, refunds, cancellations) instead of only net.
Use a rolling window (e.g., last 14 days) to detect drift and late postings.
Track variance trends: a small but growing gap can signal a mapping issue.

Monitoring and Incident Response Basics

What to monitor

Pipeline health: job success/failure, runtime, retries, backlog/lag.
Freshness: last load time per critical table.
Data quality signals: test pass rates, null spikes, duplicate rates, orphan rates.
Metric stability: sudden changes in key KPIs beyond expected seasonality.
Usage impact: which dashboards/teams depend on affected datasets.

Alerting thresholds: choose actionable signals

Alerts should be rare enough that people trust them, but sensitive enough to catch real issues early. Common patterns:

Static thresholds: freshness > 2 hours late; orphan rate > 0.5%.
Dynamic thresholds: row count deviates > 3 standard deviations from 30-day baseline.
Multi-signal alerts: only page when freshness is late and row count is low (reduces noise).

Triage steps (a practical runbook)

Confirm the symptom: which test/monitor fired, which dataset, which time window.
Assess blast radius: what dashboards, metrics, and stakeholders are affected.
Check recent changes: deployments, upstream schema changes, new filters, backfills.
Identify failure mode: missing load vs duplicates vs late-arriving vs schema drift.
Mitigate quickly: pause publishing, roll back, apply a hotfix, or mark data as delayed.
Validate the fix: rerun tests, re-reconcile control totals, confirm dashboards.
Document: root cause, detection gap, prevention action (new test/threshold/runbook update).

Communicating data incidents without breaking trust

Trust is maintained by being fast, clear, and consistent—without overpromising. A good incident message includes:

What happened: “Orders dashboard is showing undercounted revenue for 2026-01-15.”
Impact: which metrics/reports are affected and which are not.
Time window: when the issue started and the periods impacted.
Workaround: use a source report, last known good snapshot, or exclude the affected date.
Next update time: when stakeholders will hear back (even if not resolved).

Use careful language that distinguishes data delay from data incorrectness. For example:

Delay: “Data for today is not yet complete; numbers will update after 08:00 UTC.”
Incorrect: “A deduplication bug inflated orders by ~3% for the last 2 days; we are correcting and will backfill.”

After resolution, share a brief post-incident note focused on prevention: what guardrail will stop this from recurring (new uniqueness test, stronger freshness alert, schema contract, improved reconciliation tolerance rules).

Now answer the exercise about the content:

When setting up automated data validation in BI, which approach best reduces incidents caused by ambiguous expectations?

You are right! Congratulations, now go to the next page

You missed! Try again.

Clear reliability improves when expectations are explicit. Defining grain, freshness SLAs, and tolerances enables automated tests to detect missing data, duplicates, late loads, and other regressions early.