1) Validate the Draft Schema with Core Use Cases
A schema is “good” when it supports the work the system must do with minimal friction: queries are straightforward, constraints match expectations, and common reports don’t require fragile workarounds. Validation means translating requirements into concrete tasks (queries, screens, reports) and checking that each task maps cleanly to tables, keys, and relationships.
Derive core tasks from requirements
Start with the requirement statements and extract verbs and outputs. Turn them into a short list of “core use cases” that represent daily operations and decision-making reports. Keep them specific enough that you can write a query for each.
- Operational queries: “Create an order,” “List open tickets,” “Assign a case to an agent.”
- Lookups: “Find customer by email,” “Show product availability by warehouse.”
- Reports: “Revenue by month,” “Top 10 products,” “SLA breaches by team.”
- Audits: “Who changed the status and when?” (if required)
Confirm each use case has a clean path through the schema
For each use case, answer these validation questions:
- What tables are involved? Can you name them immediately?
- What are the join paths? Are joins direct and stable (via keys), or do you need to join on text fields?
- What filters are needed? Do you have the right columns to filter without parsing strings?
- What grouping is needed? Do you have clear facts to aggregate (amounts, counts) and clear dimensions (dates, categories)?
- What is the “grain”? Are you aggregating at the right level (order vs. order line)?
Practical step-by-step: validate with a query checklist
Use this repeatable workflow to validate quickly:
- Step 1: Write the query in plain English. Example: “For each customer, show total paid amount in the last 30 days.”
- Step 2: Identify the fact table. Where is the measurable event stored (e.g., payments, order_lines)?
- Step 3: Identify dimensions. Customer, date, status, product, etc.
- Step 4: Draft the SQL joins. Ensure every join uses keys, not names.
- Step 5: Check missing columns. If you need “paid_at” but only have “created_at,” the schema may not support the report.
- Step 6: Check ambiguity. If “customer email” exists in multiple tables, you risk inconsistency.
Example validation SQL (illustrative):
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
-- Revenue by month (paid orders only) with a clean join path and clear grain (order_line as fact) SELECT date_trunc('month', p.paid_at) AS month, SUM(ol.quantity * ol.unit_price) AS revenue FROM payments p JOIN orders o ON o.order_id = p.order_id JOIN order_lines ol ON ol.order_id = o.order_id WHERE p.status = 'PAID' GROUP BY 1 ORDER BY 1;If this query requires joining through a “notes” table, parsing a combined “address” string, or guessing which timestamp represents “paid,” that’s a sign the draft schema doesn’t match the use case cleanly.
2) Stress-Test Edge Cases and Constraint Behavior
Real systems break on edge cases: optional relationships, deletions, and duplicate records are where design decisions become visible. Validation here means predicting what should happen, then confirming the schema enforces it (or intentionally allows it).
Edge case A: optional relationships
Optional relationships often appear as nullable foreign keys. The validation question is: “When the relationship is missing, can the system still represent the record correctly without fake values?”
- Example: A support ticket may be created without an assigned agent.
- Expected behavior: ticket.agent_id is NULL until assignment.
- Validation: Ensure queries for “unassigned tickets” are simple and constraints don’t force placeholder agents.
-- Unassigned tickets should be easy to find SELECT ticket_id, created_at FROM tickets WHERE agent_id IS NULL;Edge case B: deletions (restrict, cascade, soft delete)
Deletions are not just a UI action; they are a referential integrity decision. Validate what should happen when a referenced row is removed.
- RESTRICT: Prevent deletion if dependent rows exist (common for master data like customers).
- CASCADE: Delete dependent rows automatically (common for purely dependent children like order_lines when an order is removed).
- SET NULL: Keep the child row but remove the link (useful when the relationship is optional and history should remain).
- Soft delete: Keep rows and mark as inactive when you must preserve history and references.
Practical step-by-step: deletion decision test
- Step 1: Pick a parent row that has children (e.g., an order with order_lines).
- Step 2: Attempt to delete the parent.
- Step 3: Confirm the database behavior matches the intended rule (error, cascade, nulling, or blocked by application policy).
- Step 4: Confirm reporting still works (e.g., historical revenue isn’t lost unexpectedly).
Edge case C: merging duplicates
Duplicate entities happen: the same customer entered twice, the same supplier with two IDs, etc. A maintainable schema makes merges possible without breaking references.
Practical step-by-step: safe merge approach
- Step 1: Choose the “survivor” row (the ID you will keep).
- Step 2: Re-point foreign keys from the duplicate to the survivor (orders.customer_id, tickets.customer_id, etc.).
- Step 3: Resolve unique constraints (e.g., email uniqueness) before updating.
- Step 4: Delete or deactivate the duplicate according to your deletion policy.
- Step 5: Verify counts (no orphaned rows, no missing history).
-- Example merge pattern (simplified) BEGIN; -- Move references UPDATE orders SET customer_id = :survivor_id WHERE customer_id = :duplicate_id; UPDATE tickets SET customer_id = :survivor_id WHERE customer_id = :duplicate_id; -- Then remove duplicate (or mark inactive) DELETE FROM customers WHERE customer_id = :duplicate_id; COMMIT;If the schema stores customer name redundantly in orders (instead of referencing customers), merges become partial and error-prone. This is exactly the kind of issue edge-case testing reveals.
3) Document the Schema: Definitions and Intentional Omissions
Maintainability depends on shared understanding. Documentation should explain meaning, not restate column names. It should also record what you intentionally do not store, so future changes don’t introduce inconsistent “shadow data.”
What to document for each table
- Purpose: What real-world concept or event the table represents.
- Row meaning (grain): What one row stands for (one customer, one order, one order line).
- Key fields: Primary key and any alternate natural identifiers (if applicable).
- Relationships: What it references and what references it.
- Lifecycle: Whether rows are ever deleted, and how history is preserved.
What to document for each column
- Definition: Business meaning in one sentence.
- Allowed values: Enumerations, ranges, formats.
- Nullability meaning: If NULL is allowed, what does it mean (unknown vs. not applicable)?
- Source of truth: If derived elsewhere, state whether it is stored or computed.
Include “intentionally not stored” decisions
Common examples of intentional omissions:
- Derived totals: Not storing order_total if it can be computed from order_lines (unless performance or audit requirements justify storing it).
- Duplicated descriptive fields: Not storing customer_name on orders if customers is the source of truth.
- Transient UI state: Not storing “currently selected tab” or temporary filters.
- Ambiguous timestamps: Not storing a single “date” when you actually need created_at, submitted_at, paid_at, shipped_at.
Practical documentation template (copy/paste and fill):
Table: orders Purpose: A customer purchase request. One row per order (header). Primary key: order_id Relationships: orders.customer_id -> customers.customer_id (required) orders has many order_lines Deletion policy: orders are not hard-deleted; status indicates cancellation. Columns: order_id: Surrogate identifier for the order. customer_id: Customer placing the order. Must exist in customers. created_at: When the order was created in the system. status: Current order state. Allowed: DRAFT, SUBMITTED, CANCELLED, FULFILLED. Intentionally not stored: order_total (computed from order_lines); customer_name (from customers).4) Light Normalization Thinking: Spot Symptoms, Apply Targeted Fixes
You don’t need heavy theory to improve a schema. Use “symptoms” that show up during validation: confusing updates, duplicated facts, and inconsistent values. When you see a symptom, apply a targeted structural improvement.
Symptom 1: update anomalies (same fact stored in multiple places)
Example symptom: customer_email exists in both customers and orders. When a customer changes email, old orders now show the wrong email depending on which column is used.
Targeted improvement: keep the email only in customers (source of truth) and reference customers from orders. If historical “email at time of purchase” is required, store it explicitly as a snapshot column with a clear name (e.g., billing_email_at_purchase) and document why it exists.
Symptom 2: repeating groups (multiple similar columns)
Example symptom: phone1, phone2, phone3 columns. Queries and constraints become awkward, and you can’t easily enforce uniqueness or label types.
Targeted improvement: move phones to a child table (customer_phones) with one row per phone number and a type column.
Symptom 3: mixed concepts in one table (hard-to-explain rows)
Example symptom: a single “transactions” table stores payments, refunds, and chargebacks with many nullable columns. Reporting becomes a maze of conditional logic.
Targeted improvement: separate event types into clearer tables or introduce a shared header with type-specific detail tables, depending on how different the attributes and rules are.
Symptom 4: values that should be constrained but aren’t
Example symptom: status is free text, leading to “Shipped,” “shipped,” “SHIP,” etc.
Targeted improvement: enforce allowed values via a check constraint or a reference table, and ensure application code uses the same set.
Normalization thinking here is pragmatic: you’re not chasing perfection; you’re removing the specific structures that cause inconsistent data and painful updates.
Capstone Activity: Validate and Justify a Maintainable Schema
Use this small requirements set to practice end-to-end validation and maintainability decisions.
Requirements set (mini domain)
- A training company offers courses. Each course has a title and a base price.
- Students can enroll in courses. An enrollment has an enrolled date and a status (ENROLLED, CANCELLED, COMPLETED).
- A course can have multiple sessions (scheduled occurrences) with a start datetime and location.
- A student may enroll in a specific session (if sessions exist), but some courses allow “self-paced” enrollments without a session.
- Students can make payments toward an enrollment; multiple payments are allowed.
- Need reports: (a) revenue by month, (b) active enrollments by course, (c) session roster (students per session).
Deliverable A: ER diagram (conceptual)
Create an ER diagram that includes at minimum: Student, Course, Session, Enrollment, Payment. Show cardinalities and optionality clearly (especially the optional link from Enrollment to Session).
Deliverable B: Map to tables with keys and constraints
Write the table list and include primary keys, foreign keys, and key constraints. At minimum, decide:
- How Enrollment references Student and Course (required).
- How Enrollment references Session (optional).
- How Payment references Enrollment (required).
- How you prevent duplicate enrollments (e.g., one active enrollment per student per course, or allow multiple over time with rules).
Example table sketch (you will refine it):
students(student_id PK, full_name, email UNIQUE, created_at) courses(course_id PK, title, base_price) sessions(session_id PK, course_id FK, starts_at, location) enrollments(enrollment_id PK, student_id FK, course_id FK, session_id FK NULL, enrolled_at, status) payments(payment_id PK, enrollment_id FK, paid_at, amount, status)Deliverable C: Validate with core queries
Write SQL (or SQL-like pseudocode) for these tasks and confirm joins are clean:
- Revenue by month: sum of successful payments grouped by month.
- Active enrollments by course: count enrollments where status = ENROLLED grouped by course.
- Session roster: list students for a given session ordered by name.
-- Active enrollments by course SELECT c.title, COUNT(*) AS active_enrollments FROM enrollments e JOIN courses c ON c.course_id = e.course_id WHERE e.status = 'ENROLLED' GROUP BY c.title ORDER BY active_enrollments DESC;Deliverable D: Edge-case tests
- Optional relationship: Insert an enrollment with session_id NULL for a self-paced course; confirm it is allowed and roster queries still work for session-based enrollments.
- Deletion behavior: Decide what happens if a session is deleted. Should enrollments be blocked (RESTRICT), have session_id set to NULL (SET NULL), or be deleted (CASCADE)? Justify based on reporting needs.
- Merging duplicates: Two student records share the same email due to a data import. Decide how uniqueness is enforced and outline the merge steps to preserve enrollments and payments.
Deliverable E: Short justification of modeling choices
Write 6–10 bullet points explaining your main decisions. Include at least:
- Why you chose the grain of Enrollment and Payment.
- How you handled optional session enrollment and what NULL means.
- Your deletion policy for Course, Session, Enrollment, and Payment rows.
- Any intentional omissions (e.g., not storing course_title on enrollments; not storing enrollment_balance if computed).
- One targeted improvement you made after spotting a normalization symptom during validation.