All courses > Technology and Programming > Databases ::

Indexing 101: How Databases Find Data Fast — Data Access Paths and Why Indexes Matter

Capítulo 1

Estimated reading time: 8 minutes

The core problem: finding a few rows inside a lot of rows

Databases are good at storing data; the hard part is retrieving the right subset quickly as tables grow. If a table has 1,000 rows, many approaches feel “fast enough.” At 10 million rows, the difference between reading everything and reading only what you need becomes the difference between milliseconds and seconds (or worse).

When you run a query like “find the order with id = 12345,” the database must choose a data access path: how it will locate the matching rows. Two common paths are a full scan (read many/all rows) and an indexed lookup (jump directly to the relevant rows).

Two access paths: full scan vs indexed lookup

Example table

orders(order_id, customer_id, status, created_at, total_cents)

Assume the table is stored on disk/SSD in pages (blocks). Each page contains many rows. Reading pages is usually the dominant cost when data doesn’t fit in memory.

Access path 1: full scan (table scan)

A full scan reads the table pages and checks each row against your filter.

SELECT * FROM orders WHERE order_id = 12345;

Step-by-step mental execution for a scan:

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Read page 1 of the table.
For each row on the page, evaluate order_id = 12345.
Keep going page by page until the end (or until the engine can safely stop, which is uncommon unless there’s a special guarantee).
Return matches (likely 0 or 1 row here).

When scans are reasonable:

The table is small (few pages).
The filter matches a large fraction of rows (low selectivity), so you’d end up reading most pages anyway.
You need to touch most rows for an aggregate (e.g., COUNT(*) over the whole table).

Access path 2: indexed lookup

An index is a separate data structure that lets the database find rows by key without scanning the whole table.

CREATE INDEX idx_orders_order_id ON orders(order_id);

Now the same query can use the index:

SELECT * FROM orders WHERE order_id = 12345;

Step-by-step mental execution for an indexed lookup:

Navigate the index to find the entry for key 12345 (typically a small number of index page reads).
Follow the pointer from the index entry to the row location in the table (or directly to the row, depending on the storage engine).
Fetch the row from the table page and return it.

The key difference: instead of reading many table pages, the engine reads a few index pages plus (often) one table page.

What an index really is: a mapping from key values to row locations

Conceptually, an index stores pairs like:

(key_value) -> (row_location)

For example:

(12345) -> (page 812, slot 17)

Most general-purpose databases implement indexes as balanced trees (commonly B-tree variants) for equality and range lookups. The important idea is not the exact structure, but the behavior:

Fast equality lookup: find rows where order_id = ?.
Fast range lookup: find rows where created_at BETWEEN ... or order_id > ?.
Ordered traversal: read keys in sorted order without sorting the whole table.

Because the index is separate from the table, it has its own storage cost and must be maintained when rows are inserted, updated, or deleted.

What “faster” usually means in practice

When people say an index makes a query faster, they typically mean one or more of these improvements:

Fewer page reads (I/O): fewer table pages need to be loaded from disk/SSD into memory.
Less CPU: fewer rows must be examined and fewer comparisons performed.
Less sorting: the index order can satisfy ORDER BY without an explicit sort step.
Less work during joins: indexes can reduce the number of rows considered and speed up matching.

Note that “faster” is not guaranteed. If the filter matches most rows, the engine may still prefer a scan because scanning sequential pages can be cheaper than many random lookups.

A practical mental model: common query execution steps and where indexes help

Many queries can be understood as a pipeline of operations. Thinking in these steps helps you predict whether an index can help and where.

1) Filtering (WHERE)

Filtering reduces rows based on conditions.

SELECT * FROM orders WHERE status = 'PAID';

How an index can help:

If there is an index on status, the engine can jump directly to the entries for 'PAID' instead of scanning all rows.
If 'PAID' is very common (e.g., 90% of rows), the index may not help much; the engine might still read most table pages.

Step-by-step with an index on status:

Find the key 'PAID' in the index.
Read the matching row pointers (possibly many).
Fetch corresponding table rows (or avoid fetching if the index contains all needed columns; see “covering” below).

2) Joining (JOIN)

Joins combine rows from two tables based on a match condition.

SELECT o.order_id, c.name, o.total_cents FROM orders o JOIN customers c ON c.customer_id = o.customer_id WHERE o.created_at >= '2026-01-01';

Where indexes help:

Join key lookups: an index on customers(customer_id) helps find the matching customer for each order.
Reducing the driving set: an index on orders(created_at) can quickly find only recent orders before joining.

Mental execution sketch (one common approach):

Use an index on orders(created_at) to get recent orders.
For each order, use an index on customers(customer_id) to find the matching customer row.

If the join condition lacks an index on the lookup side, the database may be forced into more expensive strategies (e.g., scanning the other table repeatedly or building a hash table), depending on the optimizer’s choice.

3) Sorting (ORDER BY)

Sorting can be expensive because it may require reading many rows and performing an in-memory or disk-based sort.

SELECT order_id, created_at FROM orders WHERE customer_id = 42 ORDER BY created_at DESC;

Where indexes help:

An index on (customer_id, created_at) can both filter by customer_id and produce rows already ordered by created_at, avoiding a separate sort.

Step-by-step with a composite index:

Seek to customer_id = 42 in the index.
Read index entries in descending created_at order.
Return rows (and fetch table data if needed).

If you only have an index on created_at (without customer_id), the engine may still need to scan many entries and filter, or it may sort after filtering.

4) Aggregating (GROUP BY, aggregates)

Aggregations summarize data, often after filtering.

SELECT status, COUNT(*) FROM orders WHERE created_at >= '2026-01-01' GROUP BY status;

Where indexes help (and where they don’t):

An index on created_at can reduce the rows considered by quickly finding the date range.
An index can sometimes help with grouping if it provides rows in an order that matches the grouping key (e.g., an index on (created_at, status) or (status, created_at) depending on the query), but aggregation still must process all qualifying rows.
If the query must count a huge fraction of the table, an index may not reduce work much; you still need to touch many entries/rows.

Simple table examples: seeing the difference in work

Equality lookup on a unique key

SELECT * FROM orders WHERE order_id = 12345;

Full scan: potentially reads all table pages until it finds the row (and often continues logically).
Index on order_id: reads a few index pages + one table page.

Range query

SELECT order_id, total_cents FROM orders WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31';

Full scan: checks every row’s created_at.
Index on created_at: seeks to the start date and scans forward in the index until the end date.

Covering (index-only) access when you select few columns

If your query only needs columns that are present in the index, the database may avoid fetching the table rows at all (often called an index-only scan or covering index behavior).

-- If an index exists on (customer_id, created_at, order_id) or includes order_id appropriately:

SELECT order_id, created_at FROM orders WHERE customer_id = 42 ORDER BY created_at DESC;

Potential benefit: fewer table page reads because results can be produced directly from index pages.

Checklist: questions to ask when considering an index

What query pattern am I optimizing? Equality lookup, range filter, join key lookup, ORDER BY, GROUP BY?
Which columns appear in WHERE/JOIN/ORDER BY? Index the columns used to locate rows and/or produce the required order.
How selective is the filter? If a condition matches a large fraction of rows, an index may not reduce I/O much.
Do I need a composite index? If you filter by one column and sort by another, consider an index that matches that combination and order.
Can it be covering? Can the index contain (or include) the columns needed to return results without extra table lookups?
How often is the table written? Inserts/updates/deletes must maintain the index; more indexes usually mean slower writes.
What is the expected result size? Returning many rows can dominate time even with an index; the index mainly helps you find them.
Is the index supporting multiple queries? Prefer indexes that benefit common, critical queries rather than one-off cases.

Now answer the exercise about the content:

In which situation is a database more likely to choose a full table scan instead of an indexed lookup?

You are right! Congratulations, now go to the next page

You missed! Try again.

If a condition is low-selectivity (matches many rows), the database may read most pages anyway. In that case, a sequential scan can be cheaper than many index lookups plus table fetches.