MongoDB Indexing & Query Optimization

Chapter 4: MongoDB Indexing & Query Optimization

Indexes are specialized data structures that store a small portion of the collection's data set in a form that is easy to traverse. In MongoDB, indexes are critical for avoiding Collection Scans (COLLSCAN), which degrade performance linearly as data grows. By providing a sorted representation of specific fields, indexes enable the Query Optimizer to perform binary searches and range seeks with O(logN)O(\log N) complexity.

I. Index Data Structures: B-Trees & WiredTiger

MongoDB utilize B-Tree (Balanced Tree) indexes to maintain sorted data and provide efficient lookups. In the WiredTiger storage engine, indexes are stored as a hierarchy of 32KB pages. The "Root" page points to "Internal" pages, which eventually point to "Leaf" pages containing the actual field values and their corresponding RecordID (the pointer to the document in the collection's data file). B-Trees are designed for Disk-IO Efficiency; the high fan-out (many children per node) ensures that the tree depth remains low even for billions of records, typically requiring only 3-5 disk seeks to find any document.

B-Tree RootIXSCAN: Binary Seek O(log N)COLLSCAN: Linear Scan O(N)Direct PointerTarget Doc


II. Compound Indexes & The ESR Rule

A compound index is an index on multiple fields. The order of fields is the single most important factor in its effectiveness. MongoDB follows the ESR Rule (Equality, Sort, Range) for optimal index design.

  1. Equality: Fields queried with exact values (e.g., { status: "ACTIVE" }) must come first. This allows the index to prune the search space to the smallest possible subset immediately.
  2. Sort: The field used for sorting should come next. If the index is already sorted on this field within the equality results, MongoDB can perform a Covered Sort, returning results without an expensive in-memory sort.
  3. Range: Fields queried with range operators (e.g., $gt, $lt) must come last. Once a range is scanned, the index's internal sort order is "broken" for any subsequent fields.

Technical Insight: If you place a Range field before a Sort field, the index will filter the data correctly, but the resulting set will not be sorted. This forces the query engine to copy all matching documents into a 100MB RAM buffer to perform an In-Memory Sort, which often leads to query failures if the set is large.


III. Query Execution Analysis (Explain Plan)

To verify if your indexes are being used effectively, always use the .explain("executionStats") method. The output provides a detailed breakdown of the query execution stages:

  • IXSCAN: The query successfully used an index to locate documents.
  • FETCH: The engine had to retrieve the full document from disk because the index did not "cover" all requested fields.
  • PROJECTION_COVERED: The highest performance state, where all requested data was retrieved directly from the index (no document read required).

IV. Production Anti-Patterns

  • Indexing Low-Cardinality Fields: Creating a standalone index on a field like is_verified (boolean). The index overhead usually outweighs the benefit, as the query engine will still have to scan a huge percentage of the index pages.
  • Index Prefix Redundancy: Creating an index on { a: 1, b: 1 } and another on { a: 1 }. The second index is redundant because the first one can handle all queries that only filter on a.
  • Uncovered Sorts on Large Datasets: Sorting on a field that isn't part of the index. This results in the "Sort stage used more than 100MB RAM" error and crashes the application query.

V. Performance Bottlenecks

  • Index RAM Resident Set: If the total size of your indexes exceeds the WiredTiger cache, the system will start "thrashing"—constantly reading index pages from disk and evicting others. This leads to catastrophic latency spikes.
  • B-Tree Page Splits: High-frequency random inserts (e.g., random UUIDs) cause index pages to split, leading to fragmentation and "holes" in the index structure that waste RAM and disk space.
  • Write Latency Cumulative Cost: Every index adds a write penalty. A collection with 10 indexes must update 10 B-Trees for every insert or delete, which can saturate disk IOPS.