Chapter 8: Performance Tuning & Indexes

Indexes are specialized data structures that optimize data retrieval by providing an alternative, sorted path to records. In SQL engineering, the "Correct Index" is the primary lever for transforming a query from an $O(N)$ full table scan to an $O(\log N)$ index seek.

I. Index Structures: B+ Trees & Hash Indexes

1. The B+ Tree (Default Standard)

Most relational databases use B+ Trees for indexing. Unlike standard binary trees, B+ trees have a high fan-out (hundreds of keys per node), keeping the tree depth low (typically 3-4 levels for billions of rows).

Internal Nodes: Store only keys for navigation.
Leaf Nodes: Store the actual values and a pointer to the row (or the row itself in clustered indexes). Leaf nodes are doubly-linked to enable fast range scans.

2. Hash Indexes

Hash indexes use a hash table to map keys to row pointers.

Performance: $O(1)$ for exact equality matches (WHERE id = 5).
Limitation: Cannot be used for range queries (WHERE id > 5) or sorting, as hash values are not ordered.

II. Index Strategies: Clustered vs. Non-Clustered

1. Clustered Index (Primary)

The table is the index. Rows are physically stored on disk in the order of the clustered index key (usually the Primary Key).

Constraint: Only one clustered index per table.
Benefit: Extremely fast for range scans on the PK.

2. Non-Clustered Index (Secondary)

A separate B+ tree structure that stores the index key and a pointer (Bookmark) to the actual data row in the clustered index.

Index-Only Scan (Covering Index): If the index contains all columns requested by the SELECT clause, the engine retrieves data directly from the index tree, bypassing the expensive table lookup.

III. Production Anti-Patterns

Over-Indexing: Adding indexes to every column. Every index adds a synchronous write penalty to every INSERT/UPDATE/DELETE. In high-write systems, this can drop throughput by 80%.
Indexing Low-Cardinality Columns: Creating an index on a boolean column (e.g., is_active). The optimizer will likely ignore the index because a "Table Scan" is more efficient than jumping between the index and table for 50% of the rows.
Unused Indexes: Retaining indexes that are never used by queries. These waste disk space and slow down DML. Use system views (e.g., pg_stat_user_indexes) to identify candidates for deletion.

IV. Performance Bottlenecks

Index Fragmentation: High-frequency random inserts (e.g., random UUIDs) cause B+ tree pages to split, leading to sparse pages and increased tree height. This requires REINDEX to fix.
Bookmark Lookup (RID Lookup): When a secondary index is used but the query requires a column not in the index. The engine must perform a random disk seek for every matching row to fetch the missing column.
Stats Skew: Outdated table statistics causing the Cost-Based Optimizer to choose an inefficient plan (e.g., a Nested Loop Join when a Hash Join would be better). Always run ANALYZE after bulk loads.