Replication & High Availability

Chapter 10: Replication & High Availability

Replication in MongoDB provides redundancy and high availability through Replica Sets. A replica set is a group of mongod instances that maintain the same data set. This distributed architecture ensures that the system can survive the failure of individual nodes and provides a mechanism for scaling read operations across the cluster.

I. The Oplog: The Distributed State Ledger

The Oplog (Operations Log) is a capped collection (local.oplog.rs) that stores a rolling record of all data-modifying operations. It is the foundational ledger that drives replication.

  • Idempotency: Every entry in the Oplog is idempotent. This ensures that if a secondary applies the same Oplog entry twice (e.g., after a network interruption), the database state remains consistent.
  • Size & Performance: The Oplog size should be configured based on the cluster's write throughput. If the Oplog is too small, a secondary that goes offline for maintenance may "fall off" the Oplog (the oldest entries are overwritten), requiring an expensive Full Initial Sync.

II. Election Mechanics & Raft Consensus

MongoDB uses the Raft Consensus Algorithm (Protocol Version 1) to manage elections and maintain a single Primary node.

  • Heartbeats & Failover: Nodes exchange heartbeats every 2 seconds. If the Primary fails to respond within 10 seconds, the secondaries initiate an election.
  • Priority & Votes: Nodes with higher priority are more likely to be elected Primary. Only nodes with votes: 1 can participate in the quorum.
  • Terms: Elections are divided into logical "Terms." A Primary only remains valid as long as its term is current. This prevents "Stale Primaries" from accepting writes after a network partition.

Primary (Term 5)Writes EnabledSecondary AFetcher ThreadSecondary BMulti-threaded Apply


III. Oplog Tailing: Fetcher vs. Applier

Secondaries perform replication using two primary internal processes:

  1. Oplog Fetcher: A single-threaded process that "tails" the Primary's Oplog using a getMore query. It retrieves batches of operations and places them in an in-memory buffer.
  2. Oplog Applier: A multi-threaded pool that takes batches from the buffer and applies them to the data files. To maintain consistency, operations on the same document are always applied in order, while operations on different documents are parallelized.

IV. Production Anti-Patterns

  • The Lone Primary (w:1): Relying on w:1 for critical data. If the Primary crashes before replication, the data is lost. Always use w:majority.
  • Chained Replication Disabled: Disabling chained replication in cross-region clusters. This forces every secondary to fetch from the Primary, saturating its network.
  • Static Oplog Allocation: Using a fixed, small Oplog in a high-growth cluster. As write volume increases, the "Oplog Window" (time until entries are overwritten) shrinks, risking replication failure.

V. Performance Bottlenecks

  • Secondary CPU Saturation: If the Oplog Applier threads cannot keep up with the Primary's write volume, the secondary will fall behind, causing Replication Lag.
  • Network Jitter in Quorum: High network latency between nodes can cause heartbeats to fail, triggering "Election Flapping" and cluster instability.
  • IOPS Saturation during Initial Sync: Adding a new node to a multi-terabyte cluster triggers an initial sync that can consume 100% of the disk and network bandwidth for hours.