All courses > Technology and Programming > Databases ::

Indexing 101: Choosing the Right Indexes from Real Query Workloads

Capítulo 9

Estimated reading time: 10 minutes

A Repeatable Workflow for Choosing Indexes from Real Workloads

This chapter turns indexing into an operational workflow you can run repeatedly as your application evolves. The goal is not “add indexes until it’s fast,” but “use evidence from real query workloads to propose a small set of indexes, then validate improvements with measurements.”

Workflow Overview

Collect representative queries
Identify the most expensive and frequent operations
Map each query to needed predicates and ordering
Propose candidate indexes (single, composite, covering)
Evaluate tradeoffs on reads vs writes
Validate improvements with measured latency and resource usage

1) Collect Representative Queries

Index selection should be driven by what the system actually runs. Collect queries from production (preferred) or a realistic staging environment with production-like data volume and distribution.

What to capture

Query text (normalized if possible): the SQL shape matters.
Bind values distribution: not every value is equally common; skew changes the best index.
Frequency: queries that run thousands of times per minute can dominate cost even if each is “small.”
Latency and resource metrics: duration, CPU time, logical reads (buffer hits), physical reads, rows returned.

Practical collection checklist

Enable database query logging/telemetry that aggregates by query fingerprint (same structure, different parameters).
Collect at least one full business cycle (e.g., weekday + weekend) if workload varies.
Keep a short list of top queries by frequency and by total time (frequency × avg time).

2) Identify the Most Expensive and Frequent Operations

Prioritize queries that are either (a) slow and user-visible, or (b) collectively expensive due to high frequency. Also look for “hidden” costs: background jobs, reporting queries, and API endpoints that fan out into many database calls.

How to rank candidates

Total time = executions × avg latency
Tail latency (p95/p99): a query that is usually fast but occasionally slow can be a major incident driver.
Resource intensity: high buffer reads, high CPU, high temp usage, high physical I/O.

Turn a workload into a short target list

Create a table (even a spreadsheet) with columns: query fingerprint, executions/min, avg ms, p95 ms, rows returned, logical reads, physical reads. Select the top 5–20 queries to optimize first.

3) Map Each Query to Predicates and Ordering

For each target query, write down what the database must do: filter rows (predicates), join rows, group/aggregate, and order/limit. Indexes primarily help by quickly locating candidate rows and by producing rows in a useful order.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Build a “query requirement card”

For each query, extract:

Equality predicates: col = ?, col IN (...)
Range predicates: col > ?, col BETWEEN ? AND ?, prefix matches
Join keys: columns used to match rows across tables
Ordering: ORDER BY columns and direction
Projection: columns returned (important for covering strategies)
Limit/Top: LIMIT changes the value of an index that supports early exit

Example: requirement cards

Q1: Recent paid orders for a customer (API endpoint /orders/recent)  WHERE customer_id = ? AND status = 'PAID'  ORDER BY created_at DESC  LIMIT 20  SELECT order_id, created_at, total_amount  Frequency: very high  Rows returned: 20

Q2: Search tickets by assignee and open state (agent dashboard)  WHERE assignee_id = ? AND state IN ('OPEN','PENDING')  ORDER BY priority DESC, updated_at DESC  LIMIT 50  SELECT ticket_id, priority, updated_at, subject  Frequency: medium  Rows returned: 50

Q3: Monthly revenue report (batch)  WHERE created_at >= ? AND created_at < ? AND status = 'PAID'  GROUP BY day(created_at)  SELECT day(created_at), sum(total_amount)  Frequency: low (daily)  Reads: very high

4) Propose Candidate Indexes (Single, Composite, Covering)

Now translate each requirement card into one or more candidate indexes. The key is to propose options, not just one answer, because different indexes optimize different parts of the query (filtering vs ordering vs covering) and have different costs.

A repeatable candidate-generation method

Start with the most selective filtering conditions that appear consistently.
Add columns needed to satisfy ORDER BY when it avoids extra sorting and enables early exit with LIMIT.
Consider a covering variant if the query is extremely frequent and returns few rows but is sensitive to extra lookups.
Propose at least one minimal index (smallest footprint) and one max-performance index (more columns, possibly covering) when justified.

Guided Example A: Multiple valid indexes for the same query

Query (Q1)  SELECT order_id, created_at, total_amount  FROM orders  WHERE customer_id = ? AND status = 'PAID'  ORDER BY created_at DESC  LIMIT 20;

Option 1: Minimal composite index for filtering + ordering

CREATE INDEX idx_orders_cust_status_created  ON orders (customer_id, status, created_at DESC);

Helps locate rows for one customer and one status, already ordered by created_at for fast “top 20.”
May still require lookups to fetch total_amount if not in the index.

Option 2: Covering variant for maximum read performance

CREATE INDEX idx_orders_cust_status_created_cover  ON orders (customer_id, status, created_at DESC)  INCLUDE (order_id, total_amount);

Can avoid extra table lookups for this endpoint (depending on database support for included columns).
Higher write and storage cost than Option 1.

Option 3: Two single-column indexes (usually not ideal here)

CREATE INDEX idx_orders_customer_id ON orders (customer_id);  CREATE INDEX idx_orders_status ON orders (status);

May not support ordering; may lead to scanning many rows for a customer then filtering by status, or vice versa.
Even if the optimizer can combine indexes in some engines, it often won’t beat a purpose-built composite index for a top-N ordered query.

Learner justification prompt: If status='PAID' is true for 95% of a customer’s orders, does it still belong in the index key? If the endpoint is extremely latency-sensitive, is the covering variant worth the extra write cost?

Guided Example B: When ordering drives the index choice

Query (Q2)  SELECT ticket_id, priority, updated_at, subject  FROM tickets  WHERE assignee_id = ? AND state IN ('OPEN','PENDING')  ORDER BY priority DESC, updated_at DESC  LIMIT 50;

Option 1: Composite index aligned to filter + order

CREATE INDEX idx_tickets_assignee_state_priority_updated  ON tickets (assignee_id, state, priority DESC, updated_at DESC);

Supports filtering by assignee and state, then produces rows in the desired order for fast top 50.
Good when the dashboard is interactive and must be consistently fast.

Option 2: Composite index that prioritizes ordering after assignee

CREATE INDEX idx_tickets_assignee_priority_updated  ON tickets (assignee_id, priority DESC, updated_at DESC);

Can be better if state is not selective (most tickets are open/pending) or if the UI sometimes omits the state filter.
But may scan more rows per assignee to filter out unwanted states.

Option 3: Covering variant

CREATE INDEX idx_tickets_assignee_state_priority_updated_cover  ON tickets (assignee_id, state, priority DESC, updated_at DESC)  INCLUDE (ticket_id, subject);

Useful if the table rows are wide (large text fields) and the dashboard is hot.

Learner justification prompt: If each assignee has 10,000 open/pending tickets, which option reduces work the most? If each assignee has only 30 open/pending tickets, which option is “good enough” with the smallest footprint?

Guided Example C: Reporting query vs OLTP query (choose differently)

Query (Q3)  SELECT date_trunc('day', created_at) AS day, sum(total_amount)  FROM orders  WHERE created_at >= ? AND created_at < ? AND status = 'PAID'  GROUP BY date_trunc('day', created_at);

Option 1: Index to support the date range scan

CREATE INDEX idx_orders_created_status  ON orders (created_at, status);

Helps restrict to the time window quickly, then filter by status.
Often reasonable if many queries slice by date.

Option 2: Index to support status-first access

CREATE INDEX idx_orders_status_created  ON orders (status, created_at);

Can be better if status='PAID' is rare and the report spans a large date range.

Option 3: Avoid adding an index if it’s a one-off batch

If this report runs once per day and can be scheduled off-peak, you may accept a slower scan rather than adding a high-maintenance index to the OLTP table.

Learner justification prompt: If 90% of orders are PAID, which index key order is more likely to reduce scanned rows for a one-month report? If only 5% are PAID, how does that change?

5) Evaluate Tradeoffs: Reads vs Writes (and Operational Cost)

Every index is a product decision: you are buying faster reads with slower writes, more storage, and more maintenance overhead. Evaluate candidates using the workload, not in isolation.

Tradeoff checklist per candidate index

Read benefit: which queries get faster, and by how much (expected)?
Write cost: which tables are write-heavy, and how much additional index maintenance will occur?
Storage footprint: will the index fit in memory/cache more often, or will it increase cache pressure?
Concurrency impact: will extra index maintenance increase contention in hot tables?
Redundancy: does an existing index already cover the same leading columns and query shapes?
Stability: will the index remain useful as the product evolves, or is it narrowly tailored to a soon-to-change query?

Practical decision rule

Prefer the smallest index that achieves the target latency for the highest-impact queries. Use covering indexes selectively for extremely frequent, latency-sensitive endpoints where avoiding lookups measurably reduces time and I/O.

6) Validate Improvements with Measured Latency and Resource Usage

Index changes must be validated with measurements, not assumptions. Validation should include both the target queries and the broader workload to catch regressions.

Validation steps

Baseline: capture current latency (avg and p95/p99), CPU, logical reads, physical reads for the target queries.
Apply index change in staging with production-like data; then in production using safe rollout practices.
Compare execution metrics: did logical reads drop? Did sorting disappear? Did row counts scanned decrease?
Check side effects: write latency, lock waits, replication lag, storage growth.
Keep or revert: if the improvement is small or the cost is high, remove the index and document why.

What “success” looks like

Target query p95 latency improves meaningfully (define a threshold, e.g., 30–50% reduction).
Logical reads and CPU drop proportionally (not just latency shifting due to caching).
No unacceptable regression in write throughput or background maintenance.

Putting It Together: A Worksheet You Can Reuse

Index selection worksheet (fill this out per query)

Query fingerprint:  Frequency:  Avg / p95 latency:  Rows returned:  Logical reads:  Physical reads:  Predicates (equality):  Predicates (range):  Join keys:  ORDER BY + LIMIT:  Columns selected:  Candidate index A (minimal):  Candidate index B (ordering):  Candidate index C (covering):  Expected benefit:  Expected write/storage cost:  Validation plan + success metric:

Compact Reference Table: Choosing Index Types (and When to Avoid Another Index)

| Need / Situation | Prefer | Why | Avoid when | Notes |  |---|---|---|---|---|  | Filter on one column, many queries share it | Single-column index | Small, reusable | Column has very low selectivity or query returns large % of table | Often a building block, but don’t stack many singles if queries need combined filtering |  | Filter on multiple columns together (common pattern) | Composite index | Matches real query shape | Workload rarely uses the combination | Choose columns based on consistent predicates and access pattern |  | Filter + ORDER BY + LIMIT (top-N) | Composite index that matches filter then ordering | Enables ordered retrieval and early exit | ORDER BY differs across endpoints; index would be too specific | Consider separate indexes for distinct orderings only if both are high-impact |  | Query is extremely frequent and reads few rows but is lookup-heavy | Covering index (include projected columns) | Avoids extra table lookups | Table is write-heavy and benefit is marginal | Use selectively; measure read I/O reduction |  | Many similar indexes exist already | Reuse/extend existing index | Minimizes write/storage overhead | Extending makes it too wide and harms other queries | Prefer one strong index over multiple near-duplicates |  | Adding an index helps one low-frequency batch/report | Often avoid adding | Operational cost outweighs benefit | The batch is business-critical and runs during peak | Consider scheduling, caching, or pre-aggregation before adding OLTP indexes |  | Considering “just one more index” | Add only with measured target improvement | Prevents over-indexing | No clear workload owner, no validation plan, no rollback plan | Every index should have a named query set and success metric |

Now answer the exercise about the content:

When optimizing a very frequent query that filters by multiple columns and must return the “top N” rows in a specific order, which indexing approach best supports fast retrieval and early exit?

You are right! Congratulations, now go to the next page

You missed! Try again.

A composite index aligned to both filtering and ordering can produce rows in the needed order, reducing sorting and enabling early exit with LIMIT. Single-column indexes often can’t support the ordered top-N pattern as well.