What “Reliability” and “Quality” Mean in BI
In BI, reliability means users can depend on dashboards and reports to be correct, consistent, and available when expected. Quality means the underlying data meets defined expectations (completeness, uniqueness, validity, timeliness, and consistency) so metrics behave predictably over time.
A practical way to manage this is to treat data like a product with explicit requirements. Instead of assuming pipelines “usually work,” you define checks that must pass before data is published, and you continuously monitor for regressions after release.
Key quality dimensions to operationalize
- Completeness: required fields are populated; expected records exist.
- Uniqueness: no unintended duplicates at the intended grain.
- Validity: values fall within accepted sets or ranges.
- Consistency: relationships hold (e.g., foreign keys match), totals reconcile within known tolerances.
- Freshness: data arrives on schedule; late data is detected and handled.
- Stability: schemas and upstream logic changes do not silently break downstream outputs.
Common Data Issues in BI (and Why They Happen)
Missing data
Symptoms: sudden drops in counts, blank dimension attributes, missing days in time series.
Common causes: upstream extract failures, filtering bugs, nulls introduced by joins, new source values not mapped, permissions changes, partial loads.
Duplicates
Symptoms: inflated revenue/orders, double-counted events, multiple “current” records for the same entity.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Common causes: reprocessing without idempotency, late-arriving updates inserted as new rows, incorrect merge keys, source sending repeated events.
Late-arriving facts
Symptoms: yesterday’s numbers change today; period totals drift; “final” reports are revised.
Common causes: delayed upstream systems, batch windows, mobile/offline events, backfills, corrections posted after the fact.
Schema drift
Symptoms: pipeline failures, columns suddenly null, type mismatches, new fields appear, renamed fields break transformations.
Common causes: upstream releases, vendor API changes, event tracking changes, unannounced database migrations.
Detecting Issues with Validation Rules and Automated Tests
Validation rules are explicit expectations about data. Automated tests run those rules on a schedule (or during pipeline runs) and fail fast when something deviates. A good test suite mixes:
- Structural tests: schema, types, required columns.
- Content tests: null checks, accepted values, ranges.
- Relational tests: referential integrity, uniqueness at grain.
- Volume/timeliness tests: row counts, freshness, distribution shifts.
- Business-rule tests: domain logic (e.g., refunded orders reduce net revenue).
Designing effective rules: be explicit about grain and tolerance
Most BI incidents come from ambiguous expectations. For each critical table or metric, document:
- Grain: what makes a row unique (e.g.,
order_id, ororder_id + line_id). - Freshness SLA: how late is acceptable (e.g., “data complete by 08:00 UTC”).
- Allowed change: acceptable day-over-day variance (e.g., ±20% unless flagged as a known event).
- Known exceptions: late-arriving updates, cancellations, backfills, partial days.
Core Test Patterns (with Practical Examples)
1) Row count and volume anomaly checks
Goal: detect missing loads, duplicates, or upstream outages.
How: compare today’s row count to recent history, or enforce minimum expected counts per partition (e.g., per day).
-- Example: ensure at least N events arrived for yesterday (partitioned by event_date) SELECT COUNT(*) AS cnt FROM fact_events WHERE event_date = CURRENT_DATE - INTERVAL '1 day';Rule examples:
- Daily partitions must exist for the last 7 days.
- Yesterday’s count must be within 50–150% of the 14-day median (tune per dataset).
- No single source system contributes 0 rows if it usually contributes > X.
2) Freshness checks
Goal: detect late or stalled pipelines before users notice.
How: track the maximum ingestion timestamp or last successful load marker.
-- Example: freshness based on latest loaded timestamp SELECT MAX(loaded_at) AS last_load FROM fact_orders;Rule examples:
last_loadmust be within 2 hours of current time for hourly feeds.- Daily tables must have a completed partition by 08:00 UTC.
3) Referential integrity checks
Goal: ensure facts link to dimensions; prevent “unknown” categories from silently growing.
How: verify foreign keys exist in the referenced dimension (or that unknown handling is intentional and bounded).
-- Example: orphan fact rows SELECT COUNT(*) AS orphan_cnt FROM fact_sales f LEFT JOIN dim_customer c ON f.customer_id = c.customer_id WHERE c.customer_id IS NULL;Rule examples:
- Orphan count must be 0 (or < a small threshold if late dimension updates are expected).
- Percentage of “unknown” dimension keys must be < 0.5%.
4) Uniqueness checks (deduplication safety)
Goal: prevent double counting by enforcing uniqueness at the intended grain.
How: assert that the natural key (or surrogate key) is unique.
-- Example: duplicates by business key SELECT order_id, COUNT(*) AS c FROM fact_orders GROUP BY order_id HAVING COUNT(*) > 1;Rule examples:
order_idmust be unique in the “current orders” table.order_id + line_idmust be unique in the order lines table.
5) Accepted values and range checks
Goal: catch invalid codes, broken mappings, and type issues that still “load.”
How: enforce enumerations and numeric boundaries.
-- Example: accepted status values SELECT status, COUNT(*) FROM fact_orders GROUP BY status; -- Validate status in ('created','paid','shipped','cancelled','refunded')Rule examples:
statusmust be in an approved set.quantitymust be >= 0 and < 10,000.discount_pctmust be between 0 and 1.order_totalmust be >= 0 (or allow negatives only for refunds, explicitly).
6) Schema drift detection
Goal: detect upstream changes early and prevent silent nulls or broken transformations.
How: validate expected columns and types; alert on new/removed columns; enforce contracts for critical feeds.
Rule examples:
- Required columns must exist (fail pipeline if missing).
- Column types must match expected (e.g.,
order_dateremains a date). - New columns are allowed but must be logged for review (to avoid surprises).
Step-by-Step: Building a Practical Testing Strategy
Step 1: Identify “critical data products”
List the tables and metrics that drive decisions or external reporting (e.g., revenue, active users, inventory). Prioritize tests there first.
Step 2: Define expectations in plain language
For each critical dataset, write 5–10 rules that a business stakeholder would agree with, then translate them into checks. Example for orders:
- Every order has an
order_id,created_at, andstatus. - Statuses are from a known list.
- Orders link to a customer (or are explicitly “guest”).
- Daily order volume should not drop to near zero without an incident.
Step 3: Implement automated tests at the right stages
- Ingestion stage: schema, freshness, raw row counts.
- Transformation stage: uniqueness, referential integrity, null checks, accepted values.
- Semantic/reporting stage: metric-level checks, reconciliation totals, distribution checks.
Step 4: Decide what happens on failure
Not every failure should block publishing. Use severity levels:
- Blocker: publish is stopped (e.g., missing partition, broken schema, massive duplicates).
- Degraded: publish continues but alerts are sent (e.g., small orphan rate, minor freshness delay).
- Informational: logged for review (e.g., new column added upstream).
Step 5: Make tests maintainable
- Keep rules close to the dataset they validate.
- Version control test definitions with code changes.
- Review thresholds quarterly; adjust for seasonality and growth.
Reconciliation: Comparing Warehouse Numbers to Source Systems
Reconciliation is the practice of comparing key totals in the warehouse (or BI layer) to authoritative source systems to confirm that the pipeline is complete and logic is aligned. The goal is not to force exact equality in all cases; it is to explain differences and ensure they stay within expected bounds.
What to reconcile (common examples)
- Financial: gross sales, net sales, refunds, tax, shipping, payouts.
- Operational: order counts, shipment counts, tickets created/closed.
- Customer: new customers, active customers, churn events.
Expected differences (and how to explain them)
| Difference type | Why it happens | How to handle |
|---|---|---|
| Timing / cutoffs | Source uses local time; warehouse uses UTC; batch windows differ | Reconcile by consistent time zone and define “business day” cutoffs |
| Late-arriving updates | Orders updated after initial creation; events arrive late | Use a lookback window (e.g., recompute last 7 days) and track revisions |
| Returns and cancellations | Source may show gross at purchase time; warehouse may show net after returns | Reconcile gross and net separately; document which metric is used where |
| Data scope | Warehouse excludes test orders, internal users, or certain channels | Ensure filters match; maintain a shared exclusion list |
| Rounding and currency | Different rounding rules; FX rates applied at different times | Standardize rounding and FX logic; reconcile at both transaction and summary levels |
Step-by-step: Implement a reconciliation check
- Pick a control total: e.g., “yesterday gross sales” from the payments system.
- Align definitions: same time zone, same inclusion/exclusion rules, same currency.
- Compute both sides: source total vs warehouse total for the same period.
- Calculate variance: absolute difference and percent difference.
- Set tolerance: e.g., 0.5% or a fixed amount, plus a separate tolerance for known late-arriving adjustments.
- Alert and annotate: if variance exceeds tolerance, trigger an incident; if within tolerance, store the result for trend analysis.
-- Pseudocode: reconciliation result record (date, source_total, wh_total, diff, diff_pct) diff = wh_total - source_total diff_pct = diff / NULLIF(source_total, 0)Reconciliation tips that prevent false alarms
- Reconcile multiple components (gross, refunds, cancellations) instead of only net.
- Use a rolling window (e.g., last 14 days) to detect drift and late postings.
- Track variance trends: a small but growing gap can signal a mapping issue.
Monitoring and Incident Response Basics
What to monitor
- Pipeline health: job success/failure, runtime, retries, backlog/lag.
- Freshness: last load time per critical table.
- Data quality signals: test pass rates, null spikes, duplicate rates, orphan rates.
- Metric stability: sudden changes in key KPIs beyond expected seasonality.
- Usage impact: which dashboards/teams depend on affected datasets.
Alerting thresholds: choose actionable signals
Alerts should be rare enough that people trust them, but sensitive enough to catch real issues early. Common patterns:
- Static thresholds: freshness > 2 hours late; orphan rate > 0.5%.
- Dynamic thresholds: row count deviates > 3 standard deviations from 30-day baseline.
- Multi-signal alerts: only page when freshness is late and row count is low (reduces noise).
Triage steps (a practical runbook)
- Confirm the symptom: which test/monitor fired, which dataset, which time window.
- Assess blast radius: what dashboards, metrics, and stakeholders are affected.
- Check recent changes: deployments, upstream schema changes, new filters, backfills.
- Identify failure mode: missing load vs duplicates vs late-arriving vs schema drift.
- Mitigate quickly: pause publishing, roll back, apply a hotfix, or mark data as delayed.
- Validate the fix: rerun tests, re-reconcile control totals, confirm dashboards.
- Document: root cause, detection gap, prevention action (new test/threshold/runbook update).
Communicating data incidents without breaking trust
Trust is maintained by being fast, clear, and consistent—without overpromising. A good incident message includes:
- What happened: “Orders dashboard is showing undercounted revenue for 2026-01-15.”
- Impact: which metrics/reports are affected and which are not.
- Time window: when the issue started and the periods impacted.
- Workaround: use a source report, last known good snapshot, or exclude the affected date.
- Next update time: when stakeholders will hear back (even if not resolved).
Use careful language that distinguishes data delay from data incorrectness. For example:
- Delay: “Data for today is not yet complete; numbers will update after 08:00 UTC.”
- Incorrect: “A deduplication bug inflated orders by ~3% for the last 2 days; we are correcting and will backfill.”
After resolution, share a brief post-incident note focused on prevention: what guardrail will stop this from recurring (new uniqueness test, stronger freshness alert, schema contract, improved reconciliation tolerance rules).