All courses > Technology and Programming > Data Science and Business Intelligence ::

Common BI Architectures: Small Business Setups to Modern Cloud Stacks

Capítulo 6

Estimated reading time: 9 minutes

What “BI architecture” means in practice

A BI architecture is the repeatable way your organization moves data from where it is created (systems of record) to where it is analyzed (reports, dashboards, ad hoc queries). In this chapter, you’ll see common patterns that organizations adopt as they grow. Each pattern answers the same questions:

Data sources: Which systems produce the data?
Integration method: How does data move (manual export, connectors, replication, pipelines)?
Where transformations live: In spreadsheets, SQL views, ETL/ELT jobs, or a semantic/metrics layer?
How dashboards connect: Directly to sources, to a replica, to a warehouse, or via a semantic layer?
Trade-offs: Cost, complexity, performance, and governance.

Pattern 1: Spreadsheet-based reporting (the “starter” setup)

When this pattern fits

Early-stage teams with low data volume, a small number of stakeholders, and reporting needs that change frequently. Often used when data lives in SaaS tools and there is no central database.

Architecture overview

Element	Typical choice
Data sources	CRM, billing, ad platforms, support tool, CSV exports
Integration method	Manual exports, copy/paste, scheduled CSV email, lightweight connectors
Transformations live	Spreadsheet formulas, pivot tables, manual cleanup steps
Dashboards connect	Charts inside spreadsheets or simple BI tool connected to the sheet
Storage	Files (spreadsheets) as the “data store”

Practical step-by-step: build a repeatable spreadsheet report

Define a stable input area: Create a tab named raw_import where exports are pasted exactly as-is (no edits).
Normalize columns: In a second tab (staging), standardize date formats, currency, and IDs using formulas. Keep transformations in columns, not manual edits.
Create a “model” tab: Build pivot tables or summary tables (e.g., revenue by month, pipeline by stage).
Lock the process: Document the export steps (which filters, which date range) in a README tab.
Version control lightly: Duplicate the file monthly or store exports in a folder so you can reproduce numbers later.

Trade-offs

Cost: Very low (tools you already have).
Complexity: Low initially, but grows quickly as logic accumulates.
Performance: Degrades with large sheets; refresh is manual or fragile.
Governance: Weak. Hard to audit changes; multiple “versions of truth” appear.

Common failure signals

Two people maintain two different spreadsheets for the same KPI.
Monthly close requires hours of manual cleanup.
Stakeholders ask for drill-down and you can’t trace a number back reliably.

Pattern 2: Direct queries to a single database (simple “live” BI)

When this pattern fits

You have a primary database (often the application database) and want dashboards quickly. Data volume is moderate and query load is predictable.

Architecture overview

Element	Typical choice
Data sources	One operational database (e.g., Postgres/MySQL) plus maybe one SaaS export
Integration method	BI tool connects directly via read-only credentials
Transformations live	SQL views in the database, or BI tool calculated fields
Dashboards connect	Live connection to the database
Refresh	Real-time or near real-time (depends on caching)

Practical step-by-step: make direct-to-DB BI safer

Create a read-only role: Grant access only to required schemas/tables. Avoid superuser credentials.
Use a reporting schema: Put BI-facing views in a dedicated schema like reporting to separate them from application tables.
Start with views, not ad hoc joins: Encapsulate common joins and filters in views so dashboards reuse logic.
Add guardrails: Limit query timeouts, set concurrency limits, and enable caching in the BI tool if available.
Monitor impact: Track slow queries and peak usage; ensure BI traffic doesn’t degrade the app.

Trade-offs

Cost: Low (no extra infrastructure beyond the BI tool).
Complexity: Low to medium; logic can sprawl across many views and dashboards.
Performance: Risky. Analytical queries can be heavy and compete with production workload.
Governance: Medium. Views help, but definitions can still diverge across dashboards.

Common failure signals

Production database CPU spikes during business hours due to dashboards.
Teams avoid asking questions because “it might slow the app.”
Schema changes break dashboards frequently.

Pattern 3: Dedicated reporting replica (protect production while staying “live-ish”)

When this pattern fits

You still want near-real-time dashboards, but you need to isolate analytics queries from production. This is a common next step when direct queries start causing performance incidents.

Architecture overview

Element	Typical choice
Data sources	Primary operational DB
Integration method	Database replication (read replica) or log-based replication
Transformations live	SQL views on the replica, or separate reporting tables built on the replica
Dashboards connect	BI tool connects to the replica (read-only)
Latency	Seconds to minutes (depends on replication)

Practical step-by-step: implement a reporting replica pattern

Provision a replica: Create a read replica with enough CPU/RAM for BI queries.
Route BI traffic: Point the BI tool connection to the replica endpoint.
Separate workloads: If possible, use a separate replica for BI vs. other read workloads (e.g., app read scaling).
Define “reporting views”: Create stable views/tables for dashboards; avoid querying raw transactional tables directly.
Plan for replication lag: Add a “data as of” timestamp tile on dashboards to set expectations.

Trade-offs

Cost: Medium (extra database instance).
Complexity: Medium (replication, failover, lag monitoring).
Performance: Better isolation; still limited by OLTP-friendly schema and indexing.
Governance: Medium. Still easy for teams to create inconsistent definitions across dashboards.

Common failure signals

Replica becomes a “second production” with many competing consumers.
Complex analytics (cohorts, long time windows) remain slow due to row-level transactional design.
Need to combine multiple sources beyond the primary DB.

Pattern 4: Warehouse-centered approach with ETL/ELT pipelines (scalable and multi-source)

When this pattern fits

You have multiple data sources, growing data volume, and a need for consistent definitions across teams. This pattern is often the turning point from “reporting” to “analytics platform.”

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Architecture overview

Element	Typical choice
Data sources	Operational DB(s), SaaS apps, event data, files
Integration method	Scheduled ingestion via ETL/ELT pipelines (batch or micro-batch)
Transformations live	In the warehouse (ELT) or in ETL jobs before load; often layered (staging → curated)
Dashboards connect	BI tool connects to curated warehouse tables/views (and sometimes marts)
Refresh	Scheduled (e.g., hourly/daily) or near-real-time for key tables

Practical step-by-step: design a warehouse-centered flow

List sources and refresh needs: For each source, define latency requirements (e.g., finance daily, product usage hourly).
Ingest raw data consistently: Land data into a raw or staging area with minimal changes. Keep metadata like load time and source.
Build curated tables: Create cleaned, conformed tables for common analysis (customers, orders, subscriptions, campaigns).
Publish BI-ready datasets: Expose only curated tables/views to the BI tool; restrict raw access for most users.
Set data quality checks: Add checks for row counts, null rates, and key uniqueness before publishing.
Document ownership: Assign owners for critical datasets and define change processes.

Trade-offs

Cost: Medium to high (warehouse + pipelines + storage).
Complexity: Medium to high (pipelines, scheduling, backfills, monitoring).
Performance: Strong for analytics; designed for large scans and aggregations.
Governance: Stronger. Centralized datasets reduce metric drift; access control is clearer.

Common failure signals

Transformation logic is duplicated across many dashboards because curated tables are missing.
Backfills are painful and break reports.
Teams can’t agree on KPI definitions because there is no single published dataset.

Pattern 5: Modern cloud BI stack (separate storage/compute, orchestration, semantic/metrics layers)

When this pattern fits

You need high concurrency, multiple workloads (BI, data science, reverse ETL), strict governance, and the ability to scale without re-platforming. This pattern is common for fast-growing companies and regulated environments.

Architecture overview

Layer	What it does	Typical components
Sources	Generate data	Apps/DBs, SaaS, event streams, files
Ingestion	Move data into analytics storage	Connectors, CDC, batch loads
Storage	Durable, cheap storage for raw/curated data	Cloud object storage + warehouse storage
Compute	Run queries and transformations	Elastic query engines/warehouses; separate compute clusters
Orchestration	Schedule and monitor pipelines	Workflow scheduler, retries, alerts, lineage hooks
Transformations	Build curated models	SQL-based transformations, incremental models, tests
Semantic/metrics layer	Central definitions, governed metrics, consistent dimensions	Metrics definitions, access rules, reusable entities
Consumption	Dashboards and self-serve analysis	BI tools, notebooks, APIs

How dashboards connect in this pattern

Preferred: BI tool queries the warehouse through a semantic/metrics layer so “Revenue” and “Active Customer” are defined once and reused everywhere.
Also common: BI tool connects directly to curated tables, with the semantic layer used for critical KPIs only.

Practical step-by-step: implement the modern stack incrementally

Start by separating workloads: Ensure analytics queries run on dedicated compute (or a warehouse designed for concurrency), not on production systems.
Add orchestration: Put ingestion and transformation jobs under a scheduler with retries, alerts, and run logs.
Standardize transformation layers: Adopt a clear structure such as raw → staging → curated → marts. Enforce naming conventions.
Introduce a semantic/metrics layer for top KPIs: Choose 5–15 business-critical metrics and define them centrally (filters, time logic, currency rules, access constraints).
Implement governance controls: Role-based access, row-level security where needed, and audit logs for sensitive datasets.
Optimize for scale: Use incremental models, partitioning/clustering, and separate compute resources for heavy transformations vs. BI queries.

Trade-offs

Cost: Higher baseline, but can be cost-efficient at scale due to elastic compute and cheaper storage.
Complexity: Higher. More components require operational discipline (monitoring, incident response, change management).
Performance: Excellent when designed well: high concurrency, faster queries, predictable SLAs.
Governance: Strongest. Centralized metrics, access controls, lineage, and auditability are easier to enforce.

Choosing the right architecture: a decision guide

Key selection factors

Factor	What to measure	Why it matters
Data volume	Rows/day, total table sizes, event frequency	Large scans and long time windows push you toward a warehouse and curated models
Users & concurrency	Number of dashboard viewers, peak simultaneous usage	High concurrency needs isolation (replica/warehouse) and caching/elastic compute
Source count	How many systems feed reporting	Multiple sources increase the need for pipelines and a central analytics store
Compliance & sensitivity	PII/PHI presence, audit requirements, access controls	Stronger governance patterns reduce risk (central access, logs, row-level security)
Expected growth	Headcount, product usage growth, new markets	Frequent schema changes and scaling needs favor modular pipelines and semantic layers

Rule-of-thumb mapping

Spreadsheet-based: Good for <5 regular stakeholders, low compliance needs, and mostly manual monthly/weekly reporting.
Direct-to-single-DB: Good for small teams needing fast dashboards, with careful guardrails and low risk to production performance.
Reporting replica: Good when production impact becomes unacceptable but requirements are still mostly single-source and near-real-time.
Warehouse-centered: Good when you have multiple sources, need consistent datasets, and want scalable performance for analytics queries.
Modern cloud stack: Good when you need high concurrency, strong governance, multiple workloads, and predictable scaling.

Practical checklist: pick your next step (not your final state)

If dashboards slow production: move from direct-to-DB to a replica immediately, then plan a warehouse.
If you reconcile numbers across tools: move to a warehouse-centered approach with curated tables.
If KPI definitions drift across teams: introduce a semantic/metrics layer for critical metrics first.
If compliance requirements increase: centralize access in the warehouse, implement role-based controls, and restrict raw data exposure.
If growth is expected: invest early in orchestration, testing, and incremental transformations to avoid brittle pipelines.

Now answer the exercise about the content:

A team’s dashboards are slowing the production database during business hours, but they still need near-real-time reporting from that same operational database. Which BI architecture pattern best addresses this need?

You are right! Congratulations, now go to the next page

You missed! Try again.

A reporting replica isolates analytical queries from the production workload while keeping data near real-time (seconds to minutes of lag). This prevents dashboards from causing CPU spikes on the primary database.