Chapter 8: Performance Tuning & Indexes
Indexes are specialized data structures that optimize data retrieval by providing an alternative, sorted path to records. In SQL engineering, the "Correct Index" is the primary lever for transforming a query from an full table scan to an index seek.
I. Index Structures: B+ Trees & Hash Indexes
1. The B+ Tree (Default Standard)
Most relational databases use B+ Trees for indexing. Unlike standard binary trees, B+ trees have a high fan-out (hundreds of keys per node), keeping the tree depth low (typically 3-4 levels for billions of rows).
- Internal Nodes: Store only keys for navigation.
- Leaf Nodes: Store the actual values and a pointer to the row (or the row itself in clustered indexes). Leaf nodes are doubly-linked to enable fast range scans.
2. Hash Indexes
Hash indexes use a hash table to map keys to row pointers.
- Performance: for exact equality matches (
WHERE id = 5). - Limitation: Cannot be used for range queries (
WHERE id > 5) or sorting, as hash values are not ordered.
II. Index Strategies: Clustered vs. Non-Clustered
1. Clustered Index (Primary)
The table is the index. Rows are physically stored on disk in the order of the clustered index key (usually the Primary Key).
- Constraint: Only one clustered index per table.
- Benefit: Extremely fast for range scans on the PK.
2. Non-Clustered Index (Secondary)
A separate B+ tree structure that stores the index key and a pointer (Bookmark) to the actual data row in the clustered index.
- Index-Only Scan (Covering Index): If the index contains all columns requested by the
SELECTclause, the engine retrieves data directly from the index tree, bypassing the expensive table lookup.
III. Production Anti-Patterns
- Over-Indexing: Adding indexes to every column. Every index adds a synchronous write penalty to every
INSERT/UPDATE/DELETE. In high-write systems, this can drop throughput by 80%. - Indexing Low-Cardinality Columns: Creating an index on a boolean column (e.g.,
is_active). The optimizer will likely ignore the index because a "Table Scan" is more efficient than jumping between the index and table for 50% of the rows. - Unused Indexes: Retaining indexes that are never used by queries. These waste disk space and slow down DML. Use system views (e.g.,
pg_stat_user_indexes) to identify candidates for deletion.
IV. Performance Bottlenecks
- Index Fragmentation: High-frequency random inserts (e.g., random UUIDs) cause B+ tree pages to split, leading to sparse pages and increased tree height. This requires
REINDEXto fix. - Bookmark Lookup (RID Lookup): When a secondary index is used but the query requires a column not in the index. The engine must perform a random disk seek for every matching row to fetch the missing column.
- Stats Skew: Outdated table statistics causing the Cost-Based Optimizer to choose an inefficient plan (e.g., a Nested Loop Join when a Hash Join would be better). Always run
ANALYZEafter bulk loads.