Lesson Advanced

Index Intuition

An index is a secondary data structure that lets the database find rows matching a predicate without scanning the whole table. Traditional OLTP databases (Postgres, MySQL, SQL Server) use B-tree indexes: sorted lookup structures that turn "find user_id = 42" from O(N) into O(log N).
The mental model that matters: indexes are a trade. You pay in write speed and storage; you earn in read speed, but only for the specific access patterns the index supports. Design is about picking the right ones.

Left: without an index, a sequential scan checks every row to find id 42 (O(N)). Right: a B-tree index follows the root node to the matching child node straight to id 42 (O(log N)) — Without an index the engine scans every row (O(N)); a B-tree index follows a few nodes straight to the match (O(log N)).

Selectivity: The One Rule #

Selectivity = (matching rows) / (total rows). An index on status in a table where 90% of orders are 'paid' is nearly useless, because the engine still has to visit most of the table. An index on user_id across 10M orders from 1M users is highly selective (10 rows per user on average). Rule of thumb: indexes pay off when the predicate filters to roughly <10% of the table.

sql

-- Selectivity report for orders.status
SELECT
  status,
  COUNT(*) AS row_count,
  ROUND(COUNT(*) * 1.0 / (SELECT COUNT(*) FROM orders), 3) AS selectivity
FROM orders
GROUP BY status
ORDER BY row_count DESC;

If the selectivity of status = 'paid' is 0.70, a B-tree index on status is worse than a scan. If the selectivity of user_id = 42 is 0.001, an index saves 99.9% of the work.

Composite Indexes: Left-Most Prefix Rule #

A composite index on (user_id, status, created_at) efficiently supports these query shapes:

WHERE user_id = 42
WHERE user_id = 42 AND status = 'paid'
WHERE user_id = 42 AND status = 'paid' AND created_at > '2024-01-01'

But not:

WHERE status = 'paid' alone. The index is sorted by user_id first, so without that prefix the engine can't skip anything.

Column order in a composite index isn't cosmetic. Put equality columns before range columns. Once the index hits a range predicate, columns after it can't be used to narrow the scan. Among the equality columns, order by what your query mix actually filters on (so more queries can use the left-most prefix), and prefer the more selective column earlier when it's a toss-up.

Warning

DuckDB (and other columnar OLAP engines) don't use B-tree indexes the same way.
DuckDB stores data in column chunks with automatic zone maps: min/max statistics per chunk that let the engine skip whole blocks during a scan. For analytics workloads, zone maps + vectorized scans are usually faster than point-lookup indexes. DuckDB does support CREATE INDEX (ART indexes), but the payoff is mostly for highly-selective point lookups, not range scans or aggregations. The intuition below (selectivity, left-most prefix, write cost) still applies when you're designing for Postgres, MySQL, or any OLTP engine. In a warehouse, the equivalent levers are partitioning and clustering keys.

When NOT to Add an Index #

Indexes aren't free. Every insert/update/delete has to update every index on the table. Skip the index when:

The table is small enough that a scan is already fast (a few thousand rows).
The predicate is low-selectivity (returns >20% of the table).
The column changes constantly (write amplification dominates read gain).
The query runs rarely and isn't latency-sensitive.

A table with 12 indexes to serve every possible filter is usually a sign of over-indexing. Start with one index per frequent, selective access pattern, and add more only with evidence from slow-query logs.

Try It #

Inspect selectivity across orders.status. Which columns would make good index candidates in an OLTP version of this schema?

Loading SQL editor...

Practice #

Return a selectivity report for orders.user_id. Output: user_id, row_count, selectivity (fraction of all orders, rounded to 3 decimals). Order by row_count DESC, then user_id ASC as tie-breaker (ordering is not graded).

Loading SQL editor...

Mistakes to Watch For #

Indexing every foreign key "just in case." Each one slows writes and bloats storage without a matching query pattern to justify it.
Putting the low-cardinality column first in a composite index: (status, user_id) is strictly worse than (user_id, status) for most filters.
Expecting a B-tree index to help range scans over a huge fraction of the table. At low selectivity, a sequential scan wins.
Forgetting that OR across two columns usually can't use a composite index. Rewrite as UNION of two indexed lookups when it matters.
Applying OLTP indexing instincts to a columnar warehouse. BigQuery, Snowflake, Redshift, DuckDB all have different internal structures. "What's the equivalent of an index here?" is often partitioning, clustering, or zone maps, not CREATE INDEX.

Knowledge check #

4 questions

0 / 4 answered

An index on a column is most useful when:
A B-tree index on email is not used by which query?
For a foreign key column on a child table (e.g., orders.user_id), should you index it?
In columnar warehouses (BigQuery, Snowflake), the equivalent of "indexing" is usually:

Next Step #

Continue to Normal forms: why OLTP schemas split data across many tables, and why analytics warehouses deliberately flatten it back.