What “BI architecture” means in practice
A BI architecture is the repeatable way your organization moves data from where it is created (systems of record) to where it is analyzed (reports, dashboards, ad hoc queries). In this chapter, you’ll see common patterns that organizations adopt as they grow. Each pattern answers the same questions:
- Data sources: Which systems produce the data?
- Integration method: How does data move (manual export, connectors, replication, pipelines)?
- Where transformations live: In spreadsheets, SQL views, ETL/ELT jobs, or a semantic/metrics layer?
- How dashboards connect: Directly to sources, to a replica, to a warehouse, or via a semantic layer?
- Trade-offs: Cost, complexity, performance, and governance.
Pattern 1: Spreadsheet-based reporting (the “starter” setup)
When this pattern fits
Early-stage teams with low data volume, a small number of stakeholders, and reporting needs that change frequently. Often used when data lives in SaaS tools and there is no central database.
Architecture overview
| Element | Typical choice |
|---|---|
| Data sources | CRM, billing, ad platforms, support tool, CSV exports |
| Integration method | Manual exports, copy/paste, scheduled CSV email, lightweight connectors |
| Transformations live | Spreadsheet formulas, pivot tables, manual cleanup steps |
| Dashboards connect | Charts inside spreadsheets or simple BI tool connected to the sheet |
| Storage | Files (spreadsheets) as the “data store” |
Practical step-by-step: build a repeatable spreadsheet report
- Define a stable input area: Create a tab named
raw_importwhere exports are pasted exactly as-is (no edits). - Normalize columns: In a second tab (
staging), standardize date formats, currency, and IDs using formulas. Keep transformations in columns, not manual edits. - Create a “model” tab: Build pivot tables or summary tables (e.g., revenue by month, pipeline by stage).
- Lock the process: Document the export steps (which filters, which date range) in a
READMEtab. - Version control lightly: Duplicate the file monthly or store exports in a folder so you can reproduce numbers later.
Trade-offs
- Cost: Very low (tools you already have).
- Complexity: Low initially, but grows quickly as logic accumulates.
- Performance: Degrades with large sheets; refresh is manual or fragile.
- Governance: Weak. Hard to audit changes; multiple “versions of truth” appear.
Common failure signals
- Two people maintain two different spreadsheets for the same KPI.
- Monthly close requires hours of manual cleanup.
- Stakeholders ask for drill-down and you can’t trace a number back reliably.
Pattern 2: Direct queries to a single database (simple “live” BI)
When this pattern fits
You have a primary database (often the application database) and want dashboards quickly. Data volume is moderate and query load is predictable.
Architecture overview
| Element | Typical choice |
|---|---|
| Data sources | One operational database (e.g., Postgres/MySQL) plus maybe one SaaS export |
| Integration method | BI tool connects directly via read-only credentials |
| Transformations live | SQL views in the database, or BI tool calculated fields |
| Dashboards connect | Live connection to the database |
| Refresh | Real-time or near real-time (depends on caching) |
Practical step-by-step: make direct-to-DB BI safer
- Create a read-only role: Grant access only to required schemas/tables. Avoid superuser credentials.
- Use a reporting schema: Put BI-facing views in a dedicated schema like
reportingto separate them from application tables. - Start with views, not ad hoc joins: Encapsulate common joins and filters in views so dashboards reuse logic.
- Add guardrails: Limit query timeouts, set concurrency limits, and enable caching in the BI tool if available.
- Monitor impact: Track slow queries and peak usage; ensure BI traffic doesn’t degrade the app.
Trade-offs
- Cost: Low (no extra infrastructure beyond the BI tool).
- Complexity: Low to medium; logic can sprawl across many views and dashboards.
- Performance: Risky. Analytical queries can be heavy and compete with production workload.
- Governance: Medium. Views help, but definitions can still diverge across dashboards.
Common failure signals
- Production database CPU spikes during business hours due to dashboards.
- Teams avoid asking questions because “it might slow the app.”
- Schema changes break dashboards frequently.
Pattern 3: Dedicated reporting replica (protect production while staying “live-ish”)
When this pattern fits
You still want near-real-time dashboards, but you need to isolate analytics queries from production. This is a common next step when direct queries start causing performance incidents.
Architecture overview
| Element | Typical choice |
|---|---|
| Data sources | Primary operational DB |
| Integration method | Database replication (read replica) or log-based replication |
| Transformations live | SQL views on the replica, or separate reporting tables built on the replica |
| Dashboards connect | BI tool connects to the replica (read-only) |
| Latency | Seconds to minutes (depends on replication) |
Practical step-by-step: implement a reporting replica pattern
- Provision a replica: Create a read replica with enough CPU/RAM for BI queries.
- Route BI traffic: Point the BI tool connection to the replica endpoint.
- Separate workloads: If possible, use a separate replica for BI vs. other read workloads (e.g., app read scaling).
- Define “reporting views”: Create stable views/tables for dashboards; avoid querying raw transactional tables directly.
- Plan for replication lag: Add a “data as of” timestamp tile on dashboards to set expectations.
Trade-offs
- Cost: Medium (extra database instance).
- Complexity: Medium (replication, failover, lag monitoring).
- Performance: Better isolation; still limited by OLTP-friendly schema and indexing.
- Governance: Medium. Still easy for teams to create inconsistent definitions across dashboards.
Common failure signals
- Replica becomes a “second production” with many competing consumers.
- Complex analytics (cohorts, long time windows) remain slow due to row-level transactional design.
- Need to combine multiple sources beyond the primary DB.
Pattern 4: Warehouse-centered approach with ETL/ELT pipelines (scalable and multi-source)
When this pattern fits
You have multiple data sources, growing data volume, and a need for consistent definitions across teams. This pattern is often the turning point from “reporting” to “analytics platform.”
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Architecture overview
| Element | Typical choice |
|---|---|
| Data sources | Operational DB(s), SaaS apps, event data, files |
| Integration method | Scheduled ingestion via ETL/ELT pipelines (batch or micro-batch) |
| Transformations live | In the warehouse (ELT) or in ETL jobs before load; often layered (staging → curated) |
| Dashboards connect | BI tool connects to curated warehouse tables/views (and sometimes marts) |
| Refresh | Scheduled (e.g., hourly/daily) or near-real-time for key tables |
Practical step-by-step: design a warehouse-centered flow
- List sources and refresh needs: For each source, define latency requirements (e.g., finance daily, product usage hourly).
- Ingest raw data consistently: Land data into a
raworstagingarea with minimal changes. Keep metadata like load time and source. - Build curated tables: Create cleaned, conformed tables for common analysis (customers, orders, subscriptions, campaigns).
- Publish BI-ready datasets: Expose only curated tables/views to the BI tool; restrict raw access for most users.
- Set data quality checks: Add checks for row counts, null rates, and key uniqueness before publishing.
- Document ownership: Assign owners for critical datasets and define change processes.
Trade-offs
- Cost: Medium to high (warehouse + pipelines + storage).
- Complexity: Medium to high (pipelines, scheduling, backfills, monitoring).
- Performance: Strong for analytics; designed for large scans and aggregations.
- Governance: Stronger. Centralized datasets reduce metric drift; access control is clearer.
Common failure signals
- Transformation logic is duplicated across many dashboards because curated tables are missing.
- Backfills are painful and break reports.
- Teams can’t agree on KPI definitions because there is no single published dataset.
Pattern 5: Modern cloud BI stack (separate storage/compute, orchestration, semantic/metrics layers)
When this pattern fits
You need high concurrency, multiple workloads (BI, data science, reverse ETL), strict governance, and the ability to scale without re-platforming. This pattern is common for fast-growing companies and regulated environments.
Architecture overview
| Layer | What it does | Typical components |
|---|---|---|
| Sources | Generate data | Apps/DBs, SaaS, event streams, files |
| Ingestion | Move data into analytics storage | Connectors, CDC, batch loads |
| Storage | Durable, cheap storage for raw/curated data | Cloud object storage + warehouse storage |
| Compute | Run queries and transformations | Elastic query engines/warehouses; separate compute clusters |
| Orchestration | Schedule and monitor pipelines | Workflow scheduler, retries, alerts, lineage hooks |
| Transformations | Build curated models | SQL-based transformations, incremental models, tests |
| Semantic/metrics layer | Central definitions, governed metrics, consistent dimensions | Metrics definitions, access rules, reusable entities |
| Consumption | Dashboards and self-serve analysis | BI tools, notebooks, APIs |
How dashboards connect in this pattern
- Preferred: BI tool queries the warehouse through a semantic/metrics layer so “Revenue” and “Active Customer” are defined once and reused everywhere.
- Also common: BI tool connects directly to curated tables, with the semantic layer used for critical KPIs only.
Practical step-by-step: implement the modern stack incrementally
- Start by separating workloads: Ensure analytics queries run on dedicated compute (or a warehouse designed for concurrency), not on production systems.
- Add orchestration: Put ingestion and transformation jobs under a scheduler with retries, alerts, and run logs.
- Standardize transformation layers: Adopt a clear structure such as
raw→staging→curated→marts. Enforce naming conventions. - Introduce a semantic/metrics layer for top KPIs: Choose 5–15 business-critical metrics and define them centrally (filters, time logic, currency rules, access constraints).
- Implement governance controls: Role-based access, row-level security where needed, and audit logs for sensitive datasets.
- Optimize for scale: Use incremental models, partitioning/clustering, and separate compute resources for heavy transformations vs. BI queries.
Trade-offs
- Cost: Higher baseline, but can be cost-efficient at scale due to elastic compute and cheaper storage.
- Complexity: Higher. More components require operational discipline (monitoring, incident response, change management).
- Performance: Excellent when designed well: high concurrency, faster queries, predictable SLAs.
- Governance: Strongest. Centralized metrics, access controls, lineage, and auditability are easier to enforce.
Choosing the right architecture: a decision guide
Key selection factors
| Factor | What to measure | Why it matters |
|---|---|---|
| Data volume | Rows/day, total table sizes, event frequency | Large scans and long time windows push you toward a warehouse and curated models |
| Users & concurrency | Number of dashboard viewers, peak simultaneous usage | High concurrency needs isolation (replica/warehouse) and caching/elastic compute |
| Source count | How many systems feed reporting | Multiple sources increase the need for pipelines and a central analytics store |
| Compliance & sensitivity | PII/PHI presence, audit requirements, access controls | Stronger governance patterns reduce risk (central access, logs, row-level security) |
| Expected growth | Headcount, product usage growth, new markets | Frequent schema changes and scaling needs favor modular pipelines and semantic layers |
Rule-of-thumb mapping
- Spreadsheet-based: Good for <5 regular stakeholders, low compliance needs, and mostly manual monthly/weekly reporting.
- Direct-to-single-DB: Good for small teams needing fast dashboards, with careful guardrails and low risk to production performance.
- Reporting replica: Good when production impact becomes unacceptable but requirements are still mostly single-source and near-real-time.
- Warehouse-centered: Good when you have multiple sources, need consistent datasets, and want scalable performance for analytics queries.
- Modern cloud stack: Good when you need high concurrency, strong governance, multiple workloads, and predictable scaling.
Practical checklist: pick your next step (not your final state)
- If dashboards slow production: move from direct-to-DB to a replica immediately, then plan a warehouse.
- If you reconcile numbers across tools: move to a warehouse-centered approach with curated tables.
- If KPI definitions drift across teams: introduce a semantic/metrics layer for critical metrics first.
- If compliance requirements increase: centralize access in the warehouse, implement role-based controls, and restrict raw data exposure.
- If growth is expected: invest early in orchestration, testing, and incremental transformations to avoid brittle pipelines.