Chapter 10: Replication & High Availability
Replication in MongoDB provides redundancy and high availability through Replica Sets. A replica set is a group of mongod instances that maintain the same data set. This distributed architecture ensures that the system can survive the failure of individual nodes and provides a mechanism for scaling read operations across the cluster.
I. The Oplog: The Distributed State Ledger
The Oplog (Operations Log) is a capped collection (local.oplog.rs) that stores a rolling record of all data-modifying operations. It is the foundational ledger that drives replication.
- Idempotency: Every entry in the Oplog is idempotent. This ensures that if a secondary applies the same Oplog entry twice (e.g., after a network interruption), the database state remains consistent.
- Size & Performance: The Oplog size should be configured based on the cluster's write throughput. If the Oplog is too small, a secondary that goes offline for maintenance may "fall off" the Oplog (the oldest entries are overwritten), requiring an expensive Full Initial Sync.
II. Election Mechanics & Raft Consensus
MongoDB uses the Raft Consensus Algorithm (Protocol Version 1) to manage elections and maintain a single Primary node.
- Heartbeats & Failover: Nodes exchange heartbeats every 2 seconds. If the Primary fails to respond within 10 seconds, the secondaries initiate an election.
- Priority & Votes: Nodes with higher
priorityare more likely to be elected Primary. Only nodes withvotes: 1can participate in the quorum. - Terms: Elections are divided into logical "Terms." A Primary only remains valid as long as its term is current. This prevents "Stale Primaries" from accepting writes after a network partition.
III. Oplog Tailing: Fetcher vs. Applier
Secondaries perform replication using two primary internal processes:
- Oplog Fetcher: A single-threaded process that "tails" the Primary's Oplog using a
getMorequery. It retrieves batches of operations and places them in an in-memory buffer. - Oplog Applier: A multi-threaded pool that takes batches from the buffer and applies them to the data files. To maintain consistency, operations on the same document are always applied in order, while operations on different documents are parallelized.
IV. Production Anti-Patterns
- The Lone Primary (w:1): Relying on
w:1for critical data. If the Primary crashes before replication, the data is lost. Always usew:majority. - Chained Replication Disabled: Disabling chained replication in cross-region clusters. This forces every secondary to fetch from the Primary, saturating its network.
- Static Oplog Allocation: Using a fixed, small Oplog in a high-growth cluster. As write volume increases, the "Oplog Window" (time until entries are overwritten) shrinks, risking replication failure.
V. Performance Bottlenecks
- Secondary CPU Saturation: If the Oplog Applier threads cannot keep up with the Primary's write volume, the secondary will fall behind, causing Replication Lag.
- Network Jitter in Quorum: High network latency between nodes can cause heartbeats to fail, triggering "Election Flapping" and cluster instability.
- IOPS Saturation during Initial Sync: Adding a new node to a multi-terabyte cluster triggers an initial sync that can consume 100% of the disk and network bandwidth for hours.