Chapter 5: Specialized Indexing
Beyond standard field indexes, MongoDB offers specialized index types designed for specific use cases like expiration, full-text search, and location-based queries. These indexes utilize the same underlying B-Tree infrastructure but apply specialized transformations to the data before storage.
I. TTL (Time-To-Live) Internals
TTL indexes are used to automatically remove documents from a collection after a specific duration. This is achieved through a background thread in the mongod process that executes every 60 seconds. When the thread wakes up, it performs an index scan on the TTL field, identifies expired documents, and deletes them using standard write operations.
Technical Constraint: Because the TTL monitor runs as a background task, there is no guarantee that a document will be deleted exactly when it expires. On high-volume clusters, a "Deletion Backlog" can occur if the write load from the application exceeds the TTL thread's deletion throughput, leading to temporary storage bloat.
II. Text Search & Inverted Indexes
Text indexes support search for string content within a collection. Unlike a standard B-Tree index that stores a direct mapping of string -> RecordID, a text index is an Inverted Index. It tokenizes the input string, removes "stop words" (like "the", "a"), and performs Stemming (e.g., "running" becomes "run").
- Write Penalty: Text indexes are computationally expensive. Every write to a text-indexed field requires the server to perform tokenization and update dozens of inverted index entries.
- Storage Bloat: Text indexes are significantly larger than standard indexes, often consuming 2-3x the space of the original text data.
III. Geospatial & S2 Geometry foundations
MongoDB supports complex geospatial queries using 2dsphere indexes. Instead of storing raw coordinates, MongoDB uses the Google S2 Geometry Library to project the Earth's surface onto a 3D cube and then subdivide those faces into a hierarchical grid of "cells."
Every coordinate is mapped to a 64-bit S2 Cell ID. These IDs are then stored in a standard B-Tree. Because S2 Cell IDs are numerically contiguous for geographically adjacent areas, a "nearby" search is transformed into a simple B-Tree range scan, which is incredibly fast.
IV. Production Anti-Patterns
- TTL on Random Fields: Forgetting that TTL indexes only work on fields containing a BSON Date type. Using a numeric timestamp (milliseconds) will cause the TTL monitor to ignore the index entirely.
- Regex as a Text Search Replacement: Using
$regexfor fuzzy searching on large string fields. Standard indexes cannot optimize mid-string regexes, leading to a COLLSCAN. Use a Text index or an external search engine (like Atlas Search). - Unbounded Wildcard Indexes: Creating a wildcard index
{ "$**": 1 }on a collection with thousands of unique keys. This causes "Index Explosion," where the index becomes so large it evicts the entire working set from the WiredTiger cache.
V. Performance Bottlenecks
- TTL Monitor Deletion Lag: In high-write environments, the background delete operations from the TTL monitor can compete for the same WiredTiger write tickets as the application, causing latency spikes for users.
- Stemming CPU Overhead: Text index creation and updates are CPU-intensive. High-frequency updates to large text fields can saturate the server's CPU cores during the tokenization phase.
- S2 Cell Computation Latency: Calculating the S2 Cell ID for complex polygons or high-precision coordinates requires significant floating-point math, adding latency to the write path.