Why Data Distribution Determines Whether an Index Helps
An index is most useful when it can quickly narrow the search to a small portion of a table. Whether it can do that depends less on “having an index” and more on how values are distributed in the indexed column(s). To reason about this in a measurable way, focus on three related ideas: cardinality (how many distinct values exist), selectivity (how many rows a predicate matches), and skew (whether some values are much more common than others).
Cardinality: How Many Distinct Values Exist
Cardinality is the number of distinct values in a column (or in a combination of columns). Higher cardinality usually increases the chance that a filter on that column will be selective.
Examples
- High cardinality:
email,order_id,uuid(often near-unique). - Medium cardinality:
country_code(hundreds),status(a few),category_id(thousands). - Low cardinality:
is_active(2),gender(small set),deleted_flag(2).
Cardinality alone is not enough: a column can have many distinct values but still be skewed (some values appear far more than others), which affects selectivity for specific predicates.
Selectivity: Fraction of Rows Matched by a Predicate
Selectivity is commonly expressed as the fraction (or percentage) of rows that match a predicate. If a table has 10,000,000 rows and a predicate matches 10,000 rows, selectivity is 0.1%.
How to estimate selectivity quickly
Use this mental model for equality predicates when values are roughly uniform:
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
estimated_rows_for_value ≈ total_rows / distinct_valuesThen:
selectivity ≈ estimated_rows_for_value / total_rows = 1 / distinct_valuesFor ranges (e.g., dates), estimate what fraction of the domain you’re selecting (e.g., “one day out of 365 days” ≈ 0.27%), then adjust for seasonality or known spikes.
Skew and “Hot Values”: When Uniform Assumptions Break
Skew means the distribution is not uniform. A small number of values may account for a large portion of rows (hot values). This matters because an index that is great for rare values may be poor for hot ones.
Common skew patterns
- Hot status: 90% of rows have
status = 'OPEN', the rest are spread across other statuses. - Tenant skew: One customer/tenant owns 70% of rows.
- Time skew: Most queries hit “recent” data; most rows are old (or vice versa).
- Zipf-like distribution: A few categories are extremely popular; many categories are rare.
Skew is why “column has an index” is not the same as “predicate will use the index.” The specific value matters.
Scenario 1: High-Selectivity Filters (Good Index Candidates)
High selectivity means the predicate matches a small fraction of rows. This is where indexes typically shine, because the engine can touch far fewer rows/pages than a full scan.
Practical examples
Lookup by unique or near-unique value
SELECT * FROM users WHERE email = 'a@b.com';If
emailis unique, selectivity is ~1 row out of N. An index is almost always beneficial.Highly selective range
SELECT * FROM orders WHERE created_at >= '2026-01-15' AND created_at < '2026-01-16';If the table spans years and you’re selecting one day, the fraction may be small enough for an index to win, especially if queries often target recent days.
Step-by-step: deciding if a high-selectivity predicate is a good index candidate
- Step 1: Estimate selectivity (rows matched / total rows).
- Step 2: Check whether the query needs many columns. If it needs only indexed columns, the index can be “covering” and avoids extra row lookups.
- Step 3: Consider whether the query returns few rows (e.g., tens to thousands). Index access tends to be favorable.
- Step 4: Consider sort/group needs. If the index order matches
ORDER BYor grouping, it may save extra work.
Scenario 2: Low-Selectivity Filters (Often Poor Index Candidates)
Low selectivity means the predicate matches a large fraction of the table. In that case, using an index can cause many random reads plus many row lookups, which can be slower than scanning the table (or scanning a smaller structure, depending on the engine).
Practical examples
Boolean flags
SELECT * FROM events WHERE is_active = true;If 80% of rows are active, the index would point to most of the table. A scan is often cheaper.
Hot value on a skewed column
SELECT * FROM tickets WHERE status = 'OPEN';If
OPENis the hot value (say 90%), an index onstatusmay not help for that value, even if it helps for rare statuses.
Why low selectivity often leads to scans
- Too many matches: The engine ends up touching a large portion of the table anyway.
- Random I/O cost: Index-driven access can jump around the table, causing more expensive access patterns than a sequential scan.
- Lookup amplification: If the index does not contain all needed columns, each matching index entry may require an extra fetch of the base row.
Scenario 3: Mixed Conditions (Composite Selectivity and Index Design)
Real queries often have multiple predicates. The combined selectivity can be high even if one predicate alone is not. This is where multi-column (composite) indexes and predicate combinations matter.
Example: one low-selectivity predicate plus one high-selectivity predicate
SELECT * FROM orders WHERE status = 'OPEN' AND customer_id = 12345;status = 'OPEN' might match a lot of rows, but customer_id = 12345 might match very few. Together, the result set may be small.
How to reason about combined selectivity
A rough estimate (when predicates are independent) is:
combined_selectivity ≈ selectivity(predicate1) * selectivity(predicate2)Example: if status='OPEN' matches 50% (0.5) and customer_id=12345 matches 0.01% (0.0001), combined is ~0.00005 (0.005%), which is highly selective.
Independence is often false (e.g., some customers have more OPEN orders), but the estimate is still useful for intuition.
Composite index implications (without re-teaching index internals)
- Leading column matters: A composite index is typically most effective when the query can use the leftmost (leading) columns in the index definition.
- Put the most selective predicate first? Often helpful, but not a universal rule. If you also need to support common range predicates or ordering, the best column order can change.
- Covering reduces lookups: If the index includes (or stores) the columns needed by the query, it can avoid fetching the base row for each match.
Why an Optimizer Might Choose a Scan Even When an Index Exists
Query optimizers choose plans based on estimated cost, not on the existence of an index. A scan can be chosen when the optimizer estimates that scanning will touch fewer resources (I/O, CPU, memory) than using the index.
Common reasons scans win
- Low selectivity: Too many rows match, so index access plus lookups costs more than scanning.
- Many columns requested:
SELECT *often forces many lookups if the index doesn’t cover the query, making index access expensive. - Small table: Scanning a small table can be cheaper than navigating an index and performing lookups.
- Out-of-date or coarse statistics: If the optimizer misestimates selectivity (especially with skew), it may pick a scan or an index plan incorrectly.
- Parameter sensitivity: The same query shape with different parameter values can have very different selectivity (hot vs rare values). A plan chosen for one value may be reused for another.
- Parallelism and throughput: Some engines can parallelize scans efficiently; index lookups may be less parallel-friendly depending on access pattern.
Cost intuition: seek + lookups vs scan
Think of two competing costs:
- Scan cost: proportional to table size (or partition size) read sequentially.
- Index plan cost: cost to traverse the index + cost to fetch matching rows (often many random reads) + cost to process them.
As the number of matched rows grows, the index plan cost grows quickly because each additional match can trigger additional row fetch work. At some threshold, scanning becomes cheaper.
Practical Exercises: Predict the Better Access Path
These exercises train you to estimate selectivity and reason about whether an index seek plus lookups is likely better than a scan. Assume a row lookup is relatively expensive compared to sequential scanning, and assume the index is on the filtered column(s) but does not necessarily cover all selected columns.
Exercise 1: High cardinality equality
Table: users, 5,000,000 rows. email has 5,000,000 distinct values (unique). Query:
SELECT * FROM users WHERE email = ?;- Step 1: Estimate matched rows: ~1.
- Step 2: Selectivity: 1 / 5,000,000 = 0.00002%.
- Prediction: Index seek is overwhelmingly likely to beat a scan, even if it needs one lookup to fetch the full row.
Exercise 2: Low cardinality flag
Table: events, 100,000,000 rows. is_active: 90% true, 10% false. Query:
SELECT * FROM events WHERE is_active = true;- Step 1: Matched rows: ~90,000,000.
- Step 2: Selectivity: 90%.
- Prediction: A scan is likely cheaper than using an index, because the index would still lead to fetching most rows.
Exercise 3: Skewed “hot value” vs rare value
Table: tickets, 20,000,000 rows. status distribution: OPEN 70%, CLOSED 25%, ESCALATED 5%. Query A:
SELECT * FROM tickets WHERE status = 'OPEN';Query B:
SELECT * FROM tickets WHERE status = 'ESCALATED';- Step 1: Matched rows A: 14,000,000 (70%). Matched rows B: 1,000,000 (5%).
- Step 2: Predict plan A: scan likely. Predict plan B: index more plausible, but still depends on whether 1,000,000 lookups is too many and whether the query can be covered.
- Step 3: If Query B returns only a few columns that can be covered by the index, index usage becomes much more attractive.
Exercise 4: Mixed predicates with a selective column
Table: orders, 50,000,000 rows. status: OPEN 40%. customer_id has 10,000,000 distinct values, but distribution is uneven. Query:
SELECT order_id, created_at, total_amount FROM orders WHERE status = 'OPEN' AND customer_id = ?;- Step 1: Estimate rows for a typical customer: 50,000,000 / 10,000,000 = 5 orders on average.
- Step 2: Combine with OPEN fraction: 5 * 0.4 = 2 rows (rough estimate).
- Step 3: Prediction: with an index that can efficiently use
customer_id(and ideally alsostatus), an index plan is likely better than scanning 50,000,000 rows. - Step 4: If the customer is a “whale” with millions of orders, the estimate breaks; the optimizer may prefer a different plan for that parameter value.
Exercise 5: Range predicate on time with recent-data skew
Table: pageviews, 2,000,000,000 rows over 2 years. Most queries ask for the last hour; ingestion is steady. Query:
SELECT COUNT(*) FROM pageviews WHERE viewed_at >= NOW() - INTERVAL '1 hour';- Step 1: One hour out of two years is a tiny fraction, so selectivity is high.
- Step 2: Prediction: an index on
viewed_atis likely beneficial for this query pattern. - Step 3: If the query asks for a large range (e.g., last 6 months), selectivity drops and a scan (or partition pruning, if applicable) may be better.
Validating in Practice: Avoid Assumptions
Predictions based on cardinality and selectivity are a starting point, not a guarantee. Validate with real query behavior and real data distribution.
Practical validation checklist
- Measure actual matched rows: Compare your estimate to reality (e.g., how many rows does the predicate return?).
- Check plan choices for different values: Test hot values and rare values; skew can flip the best plan.
- Observe whether lookups dominate: If the query needs many columns and returns many rows, lookups can overwhelm index benefits.
- Watch for misestimates: If the plan seems wrong, the optimizer may be working with inaccurate selectivity estimates.
- Use repeatable tests: Run the query under similar conditions and compare elapsed time and logical/physical reads where your database exposes them.