Anatomy of a B-tree Index
A B-tree (more precisely, a B+tree in many database engines) is a balanced search tree designed for storage systems where reading data happens in fixed-size pages (blocks). The index is organized into levels, and all lookups follow the same pattern: start at the top, make decisions using sorted keys, and descend until reaching the leaf level.
Root node
The root is the entry point. It contains a small set of sorted separator keys and pointers to child pages. The root is typically cached in memory because it is accessed for most index operations.
Internal nodes (branch levels)
Internal nodes also contain sorted separator keys plus child pointers. Their job is routing: they narrow the search space by choosing which child page could contain the target key (or key range). They do not usually store row pointers for every key; instead, they store enough information to guide you to the correct leaf page.
Leaf level
The leaf level contains the actual index entries. Leaf pages are kept in sorted key order, and in many implementations leaf pages are linked (a “next leaf” pointer), which makes range scans efficient.
Each leaf entry typically stores:
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
- Index key (e.g.,
last_name, or(last_name, first_name)) - Row locator: either a physical pointer, a row identifier (RID), or the table’s primary key value (common when the table is clustered by primary key).
This is a key idea: an index usually does not store the full row. It stores enough to find the row quickly. Storing full rows in every secondary index would multiply storage and update costs.
Leaf entries: pointing to rows vs storing full rows
Consider a table:
customers(id PK, last_name, first_name, email, city)A secondary B-tree index on last_name might have leaf entries like:
(last_name='Nguyen') -> row_id=8421When you query by last_name, the index can find matching leaf entries quickly, but it may still need to fetch the base table row(s) to return columns like email or city. That extra step is often called a “table lookup” (or “bookmark lookup”).
Some engines can store extra non-key columns in the leaf (often called “included columns”) to avoid fetching the base row for certain queries, but the default mental model is: index leaf entry points to the row, it doesn’t duplicate the whole row.
How Search Navigates the Tree
B-tree search is a repeated “choose the right child page” process. At each node, keys are sorted, so the engine can determine which pointer to follow by comparing the search key to the separator keys.
Worked example: equality predicate
Suppose we have an index on last_name with these leaf keys (simplified):
Leaf pages (sorted order overall): [Anderson, Chen, Diaz, Gupta, Kim, Lopez, Nguyen, Patel, Smith, Zhang]Imagine the root has separator keys that split the space:
Root separators: [Gupta, Nguyen]Conceptually, that means:
- Keys < Gupta go to child A
- Gupta ≤ keys < Nguyen go to child B
- Keys ≥ Nguyen go to child C
Now run:
SELECT * FROM customers WHERE last_name = 'Patel';Step-by-step navigation:
- Step 1 (root): Compare 'Patel' to [Gupta, Nguyen]. 'Patel' ≥ Nguyen, so follow child C.
- Step 2 (internal node): Child C is an internal page with more separators, for example [Patel, Smith]. Compare 'Patel' to these. You choose the child that could contain 'Patel'.
- Step 3 (leaf): Arrive at the leaf page that covers the key range containing 'Patel'. Within the leaf page, find the exact key 'Patel' and read its row locator(s).
- Step 4 (row fetch if needed): Use the row locator(s) to fetch the full row(s) from the table if the query needs columns not present in the index leaf.
If multiple customers share the same last name, the leaf will contain multiple entries for 'Patel' (or one entry pointing to multiple row locators, depending on implementation). The key point is that equality search finds the correct leaf page by descending the tree, then locates the matching key(s) within that leaf.
Worked example: range predicate
Now run a range query:
SELECT id, last_name FROM customers WHERE last_name BETWEEN 'Kim' AND 'Patel' ORDER BY last_name;Step-by-step navigation:
- Step 1 (find start): Descend the tree as if searching for equality on 'Kim'. This lands you on the leaf page where 'Kim' would appear (either exactly, or the next greater key).
- Step 2 (scan forward in leaf order): From that leaf position, read entries in sorted order: Kim, Lopez, Nguyen, Patel…
- Step 3 (follow leaf links): If the range spans multiple leaf pages, follow the leaf-level “next” pointer to continue scanning without going back up the tree.
- Step 4 (stop condition): Stop once the key becomes greater than 'Patel'.
This is why B-trees support both equality and range predicates: equality uses the tree to find a single key location; range uses the tree to find the start, then uses the leaf ordering to read sequentially through the range.
Why B-trees Scale
Logarithmic depth from high fan-out
B-trees are shallow because each node (page) can hold many keys and pointers. If a page can store hundreds of separator keys, then each internal node can route to hundreds of children. That branching factor (fan-out) makes the height grow slowly as the index grows.
In simplified terms, if each internal page points to ~200 child pages, then:
- Height 1 (root only) can cover ~200 leaf pages
- Height 2 can cover ~200×200 = 40,000 leaf pages
- Height 3 can cover ~8,000,000 leaf pages
Real numbers depend on page size, key size, and pointer size, but the practical takeaway is: even large indexes often require only a few page reads to reach the correct leaf.
Page-oriented storage matches disk and memory behavior
Databases read and write in pages (for example, 8 KB or 16 KB). B-trees are built around that reality:
- Nodes are pages: each node fits in a page, so traversing the tree is “read a page, decide, read the next page”.
- Internal pages are reused heavily: many queries touch the same upper-level pages, so caching them in memory is effective.
- Leaf scans are sequential: range queries can read consecutive leaf pages, which is much more efficient than random I/O.
This page-friendly design is a major reason B-trees remain the default general-purpose index structure for OLTP-style workloads.
How Ordering Enables Ordered Retrieval (and Avoids Sorts)
A B-tree index stores keys in sorted order at the leaf level. That ordering can be used directly to satisfy ORDER BY and to produce grouped/ordered results with less work.
Ordered retrieval with an index scan
If you have:
CREATE INDEX idx_customers_last_name ON customers(last_name);Then this query can often be answered by scanning the index in order:
SELECT id, last_name FROM customers ORDER BY last_name;Instead of reading all rows and sorting them, the engine can:
- Start at the leftmost leaf page
- Read leaf entries in order
- Fetch rows only if needed (or avoid fetching if the index covers the selected columns)
Composite indexes and “prefix” ordering
Ordering benefits become more powerful with composite keys. With:
CREATE INDEX idx_customers_name ON customers(last_name, first_name);The leaf entries are ordered first by last_name, and within the same last_name, by first_name. This can support queries like:
SELECT last_name, first_name FROM customers ORDER BY last_name, first_name;and also range queries such as:
SELECT * FROM customers WHERE last_name = 'Kim' ORDER BY first_name;Because all 'Kim' entries are contiguous in the leaf level, and already sorted by first_name, the engine can read them in order without an explicit sort step.
Why avoiding sorts matters
Sorting typically requires reading many rows, comparing keys, and possibly spilling to disk if memory is insufficient. When an index can deliver rows in the needed order, the database can replace a “read then sort” plan with an “ordered index scan” plan, reducing CPU, memory pressure, and I/O.