MongoDB in production: replication, failover and backups that hold up

MongoDB's defaults are optimized for developer velocity, not operational survival. The flexible schema and fast iteration cycle are genuine advantages — but those same qualities make it trivially easy to ship a single-node instance with w:1 write concern, no index coverage on your hottest query, and a daily mongodump that has never been restored, straight into production. The first real incident — a node crash, a primary election at peak traffic, a developer dropping the wrong collection — exposes the distance between a working demo and a system that holds up. That distance is almost entirely operational. This post covers the specific decisions that determine which side of that line you end up on.

MongoDB production baselines

~12 s

Typical failover window

default electionTimeoutMillis

w:majority

Default write concern

MongoDB 5.0+ on replica sets

~1 min

Minimum RPO achievable

continuous oplog backup

Source: MongoDB Docs, default replica set configuration; MongoDB 5.0 release notes

A replica set is the minimum production unit

A standalone mongod instance has no automatic failover, no replication lag to signal data pressure, and no safe backup target that avoids interrupting writes on the primary. It is not a production deployment; it is a development convenience that happens to be listening on a public port.

A replica set needs three voting members — one primary and two secondaries — to maintain quorum after a single node failure. Two members sounds like it offers redundancy, but it does not: if one goes down, the survivor cannot see a majority of the set and steps down to read-only. You lose write availability at exactly the moment when write availability matters most.

The two standard topologies are PSS (Primary-Secondary-Secondary, all nodes hold data) and PSA (Primary-Secondary-Arbiter). PSA is cheaper and the worse choice for anything important. An arbiter votes but holds no data. On a PSA set, a primary failure leaves you with one full copy of your data until the former primary recovers. PSS maintains two complete, replicated copies at all times.

rs.initiate({
  _id: "rs0",
  members: [
    { _id: 0, host: "mongo1:27017", priority: 2 },
    { _id: 1, host: "mongo2:27017", priority: 1 },
    { _id: 2, host: "mongo3:27017", priority: 0, hidden: true,
      secondaryDelaySecs: 3600 }
  ]
})

The third member here is a hidden delayed secondary — invisible to read preference routing, one hour behind the primary, ineligible for election (priority: 0). Its purpose is a rolling undo window: when a developer drops a collection or an application bug corrupts data, you have one hour to promote this node before the oplog overwrites the point of divergence. It costs one node's worth of storage and replication bandwidth. It is worth it.

Spread your three members across three separate availability zones. A two-AZ deployment can lose quorum on a single AZ failure if the AZ hosting the primary goes down — a single zone outage should never take your writes offline.

Write concern and read preference: choose your durability contract

Write concern controls when MongoDB returns a success response to your application. It is a direct trade-off between durability and added latency and should be set deliberately per workload — it is not a one-size global switch.

| Write concern | Meaning | Durability | Typical latency impact | |---|---|---|---| | w:0 | Fire-and-forget, no acknowledgement | None — data can be lost before disk write | Near zero | | w:1 | Primary acknowledged write | Lost if primary crashes before replication | Baseline, effectively 0 ms overhead | | w:majority | Majority of nodes have written | Survives any single node failure | +1 to 10 ms depending on replication lag | | j:true | Primary journal flushed to disk | Survives mongod process crash | Small added fsync cost |

MongoDB changed the default write concern for replica sets and sharded clusters from w:1 to w:majority in version 5.0. Teams upgrading from 4.4 noticed the latency increase because a write that previously returned after a single journal flush now waits for a replication round trip to at least one secondary. That is the correct default. The w:1 default silently loses acknowledged writes when a primary steps down before replication completes — a class of data loss that is hard to detect and painful to reconcile.

For financial transactions, order state changes, and any write the user will notice, set w:majority with j:true and always include wtimeoutMS:

const client = new MongoClient(uri, {
  writeConcern: {
    w: "majority",
    j: true,
    wtimeoutMS: 5000
  }
});

Setting wtimeoutMS is mandatory in production. Without it, a write with w:majority waits indefinitely if secondaries fall behind during maintenance or a network hiccup. Five thousand milliseconds is a reasonable starting point; tune it against your 99th-percentile replication lag plus headroom. If the timeout fires, your application receives a WriteConcernError — that write may or may not have reached the primary, so your retry logic must handle it correctly.

For high-throughput event ingestion where losing a small number of events is acceptable — telemetry streams, non-critical audit logs — w:1 is a defensible choice. Make it explicit, document the trade-off, and never let it creep into write paths where users care about the outcome.

Read preference is the other half of the durability question. primary is consistent but concentrates all read load. secondary is eventually consistent (reads may lag the primary by seconds) but offloads significant traffic for analytics queries and reporting dashboards. primaryPreferred falls back to secondaries if the primary is temporarily unreachable. For any query that tolerates a few seconds of staleness, secondaryPreferred is a meaningful load-reduction tool.

Ten independent w:majority writes each wait for a replication round trip; the same ten writes inside a transaction wait once at commit time — ten times less replication overhead.

— MongoDB Performance Best Practices

What actually happens during an election

Understanding the election sequence tells you how to design your application connection handling and how to set realistic availability expectations.

MongoDB replica set election sequence

01
Heartbeat timeout
Replica set members send heartbeats every 2 seconds. If a secondary receives no response from the primary within the electionTimeoutMillis window — 10 seconds by default — it marks the primary as inaccessible and calls for an election.
02
Candidate eligibility check
A secondary calls for an election only if it holds an optime at least as recent as any voting member it has contacted. A stale secondary cannot win, preventing a node that missed recent writes from being promoted.
03
Majority vote
On a 3-node set the candidate needs 2 votes. Each voting member checks the candidate's optime against its own and refuses to vote for a candidate that is behind. The most up-to-date eligible candidate wins.
04
Primary promotion
The winning candidate steps up and writes a no-op entry to the oplog. This re-anchors all secondaries' replication cursors to the new primary's oplog position, preventing them from diverging.
05
Application driver reconnect
Modern MongoDB drivers — Node.js 4.x, Java 4.x, Python 4.x and later — detect topology changes via the hello command and automatically re-route writes to the new primary, typically within one to two server monitoring cycles.

Source: MongoDB Docs — replica set elections

The practical failover window under default configuration is 10 to 12 seconds. During this window the replica set cannot process writes. If retryable writes are enabled in the driver (the default in all drivers released after 2018), the driver holds the write and retries it against the new primary after the election completes — the error never surfaces to the application.

// Retryable writes are on by default in modern drivers.
// Explicit form for older client configurations:
const client = new MongoClient(uri, { retryWrites: true });

Test failover before production does it for you:

// On the current primary — steps down for 30 seconds
rs.stepDown(30)

Run this in a low-traffic window with monitoring open. Verify: a secondary is elected within 15 seconds, application-level write errors are absent or transient and self-resolve, and no manual intervention is required. If errors leak to users, your driver version, connection handling, or retry logic is incomplete.

Backup strategies: match your RPO, then test the restore

Backup strategy begins with two numbers. RPO (Recovery Point Objective) is how much data you can afford to lose. RTO (Recovery Time Objective) is how long the system can be unavailable. Every backup decision — tool, frequency, storage tier — is derived from those two constraints.

Worst-case RPO by backup strategy

Daily snapshot only1440 min (24 h)

4-hourly snapshot only240 min (4 h)

1-hourly snapshot only60 min (1 h)

Snapshot + continuous oplog~5 min

Source: MongoDB Architecture Center; Atlas Backup documentation

mongodump is the most commonly used backup tool and the riskiest to rely on for production recovery at scale. It is a logical export: it reads documents one by one from the running instance and serializes them to BSON. At meaningful data volumes, restoring a multi-terabyte mongodump archive from object storage takes several hours — MongoDB's own architecture documentation explicitly notes this. More critically, mongodump without --oplog is not point-in-time consistent: different collections are captured at different moments during the dump, so a restore reflects a mixed state across the dataset.

The --oplog flag mitigates this by capturing the oplog during the dump run and replaying it on restore to converge all collections to a single consistent timestamp. It is better, but it adds complexity and extends dump duration:

# Consistent dump using oplog capture — run against a secondary
mongodump \
  --uri="mongodb://mongo2:27017" \
  --oplog \
  --gzip \
  --archive="/backups/$(date +%Y%m%d-%H%M%S).archive"

For RTO measured in minutes rather than hours, use filesystem-level snapshots. Capturing a block device snapshot takes seconds; restoring means attaching the snapshot volume to a replacement instance and starting mongod, which takes minutes. On AWS, freeze the WiredTiger cache briefly before triggering the EBS snapshot to guarantee consistency:

// Hold this lock as briefly as possible — it blocks all writes
db.adminCommand({ fsync: 1, lock: true })

# Trigger the EBS snapshot — it completes asynchronously
aws ec2 create-snapshot \
  --volume-id vol-0abc123456 \
  --description "mongo-rs0-data-$(date +%Y%m%d-%H%M%S)"

// Release immediately after the snapshot request is accepted
db.adminCommand({ fsyncUnlock: 1 })

Combine EBS snapshots (or equivalent block storage snapshots on GCP/Azure) with continuous oplog streaming to an off-cluster store and you achieve sub-10-minute RPO without Atlas-specific tooling.

common

mongodump alone

Logical export — slow at terabyte scale
Not point-in-time consistent without --oplog
Restore measured in hours for large datasets
Adequate for development, small datasets, or supplementary logical exports

production

Filesystem snapshot + oplog

Physical block-level copy — seconds to capture
Point-in-time consistent via oplog replay on restore
Restore measured in minutes: attach volume, start mongod
Atlas Continuous Cloud Backup adds RPO down to approximately 1 minute via oplog retention

Backup approach comparisonSource: MongoDB Architecture Center; Bacula Systems MongoDB Backup and Restore Guide

The single most important backup discipline is a scheduled restore test. A backup that has never been restored is an untested assumption. The test should: restore to an isolated environment that mirrors production, verify document counts and a sample of business-critical records against known-good values, measure actual RTO against your target, and produce a written runbook. Do this quarterly at minimum. The first restore test almost always surfaces a gap.

Index design pitfalls that surface under load

Missing or incorrect indexes are the most common cause of MongoDB performance degradation that appears in production but was not visible during development. The mechanism is simple: WiredTiger holds its working set in an in-memory cache (default: 50% of available RAM minus 1 GB). A collection scan that pulls documents not present in cache triggers disk I/O. At development data volumes that is fast. At production volumes — hundreds of millions of documents — it degrades under concurrent load in ways that look like infrastructure failure but are query design problems.

Before adding or trusting an index, validate the actual query plan:

db.orders.find({
  status: "pending",
  createdAt: { $gt: new Date("2026-01-01") }
}).explain("executionStats")

The fields that matter: executionStats.totalDocsExamined versus executionStats.totalDocsReturned. A ratio above 10:1 means the query is scanning far more documents than it returns — the index is wrong or missing. COLLSCAN in the winning plan means no index was used at all. Fix the index before this query reaches production scale.

For compound indexes, apply the ESR rule: Equality predicates first, then Sort fields, then Range predicates. This order determines whether MongoDB traverses a single contiguous B-tree range or must sort in memory:

// Query: pending orders sorted by createdAt, within a date range
// ESR: status (equality), createdAt (serves both sort and range)
db.orders.createIndex(
  { status: 1, createdAt: 1 },
  { name: "idx_orders_status_createdAt" }
)

Three specific pitfalls that routinely appear in production:

Multikey index bloat on unbounded arrays. A multikey index stores one entry per array element per document. A document with a 100-element array contributes 100 index entries. This is correct behaviour, but if array fields grow unbounded over time — event logs, message threads, tag lists — index size grows proportionally and cache pressure increases. Monitor db.collection.stats().indexSizes over time, not just at initial deployment.

Full indexes on sparse fields. A standard index on a field that only exists on 20% of documents still indexes null for the other 80%. A partial index with partialFilterExpression indexes only the subset that matches:

// Index only pending orders — not the millions of completed/cancelled ones
db.orders.createIndex(
  { createdAt: 1 },
  { partialFilterExpression: { status: "pending" } }
)

The resulting index is smaller, faster to scan, and uses less cache. The query planner will use it when the query filter includes the expression. This pattern is especially valuable for status-gated queues and active-record patterns.

TTL index write contention. TTL indexes delete expired documents asynchronously via the TTLMonitor background task, which runs on a configurable sleep cycle (default 60 seconds). On collections with high deletion rates, this generates a burst of write I/O every minute, contending with application writes. If you observe periodic p99 write latency spikes correlated with 60-second intervals, inspect TTL activity:

// Check for running TTL deletes
db.currentOp({ "command.q._id": { $exists: false }, ns: /yourdb/ })

Reduce contention by adjusting ttlMonitorSleepSecs at the mongod level to spread deletions more evenly, or by switching to a soft-delete pattern and running bulk cleanup during off-peak hours.

Sharding: earn the complexity before reaching for it

Sharding distributes data and write load across multiple shard servers through a routing layer (mongos). It is the right tool for write volumes and data sizes that genuinely exceed what a single optimised replica set can serve. It is the wrong tool for teams that have not yet exhausted vertical scaling, indexing improvements, and read distribution to secondaries.

The operational cost is real and non-trivial. Shard key selection is effectively irreversible without a full collection migration. The balancer adds background I/O. Mongos adds a network hop to every operation. Cross-shard aggregations require scatter-gather across all shards. A sharded cluster is a distributed system with all the failure modes and operational surface that implies — you are trading simplicity for horizontal scale.

Before sharding, confirm you have worked through:

Vertical instance right-sizing for the primary workload (more RAM reduces cache pressure; faster NVMe reduces I/O latency)
Read distribution: analytics and reporting queries running against secondaries via readPreference: secondaryPreferred
Index coverage confirmed via explain() for all production query shapes
Bottleneck verified as write throughput or raw storage capacity, not query design

When the data or write volume genuinely demands sharding, shard key selection is the single most consequential decision:

| Shard key pattern | Good for | Avoid when | |---|---|---| | Hashed _id | Even write distribution, no hotspots | Range queries on the key force scatter-gather across all shards | | Ranged time field | Time-range queries, time-series data | Monotonically increasing — all inserts land on the newest shard (write hotspot) | | Compound ranged | Balanced range locality and cardinality | Complex; validate distribution under real write patterns before enabling the balancer | | Hashed user ID | Per-user datasets with even fan-out | Users with disproportionately large document sets create shard imbalance |

A monotonically increasing shard key — ObjectId-based _id, a timestamp field, an auto-incremented integer — routes all inserts to the newest chunk on the newest shard. The balancer migrates chunks away, but under sustained insert load it cannot keep pace. The result is a persistent write hotspot: one shard takes all writes while others are underutilised. Hash the key when even insert distribution matters more than range query locality:

sh.shardCollection("mydb.events", { userId: "hashed" })

Operational signals worth monitoring

A MongoDB deployment that is not monitored is not operated. The following metrics are the minimum baseline for a production replica set.

Replication lag. The time difference between the primary's latest oplog entry timestamp and the secondary's latest applied entry. Lag above a few seconds can indicate a secondary under I/O pressure or a network degradation. Sustained lag means your delayed secondary's undo window is shrinking toward zero:

rs.printSecondaryReplicationInfo()
// Example output:
// source: mongo2:27017
//   syncedTo: Tue Feb 10 2026 14:22:01 GMT+0000
//   0 secs (0 hrs) behind the freshest member

Oplog window. The oplog is a capped collection — it retains only the most recent N megabytes of operations. If a secondary's replication cursor falls off the end of the oplog (because it was down longer than the window covers), it enters RECOVERING state and requires a full initial sync, which is expensive on large datasets. The oplog window must exceed any realistic secondary outage duration:

rs.printReplicationInfo()
// Output includes:
//   configured oplog size: 10240 MB
//   log length start to end: 86403 secs (24 hrs)

Set oplog size at deployment time based on your write rate. Oplog size can be resized live via replSetResizeOplog, but sizing it correctly initially avoids the operational disruption. For write-heavy clusters, target a minimum of 72 hours of oplog retention. A secondary doing a rolling restart or maintenance window needs that buffer.

WiredTiger cache eviction pressure. When the working set exceeds the in-memory cache, WiredTiger evicts pages to disk. Evictions driven by application threads — rather than background eviction threads — are the signal that the cache is overwhelmed, and they manifest directly as write and read latency spikes:

const wt = db.serverStatus().wiredTiger.cache;
const usedPct = (
  wt["bytes currently in the cache"] /
  wt["maximum bytes configured"]
) * 100;
 
// Alert threshold: above 80%
// Critical threshold: above 95%
// "pages evicted by application threads" > 0 means immediate pressure

Slow query log. Enable profiling at the database level with a threshold appropriate for your workload:

db.setProfilingLevel(1, { slowms: 100 })

Review system.profile regularly. A sudden appearance of COLLSCAN on a query that previously used an index is not random — it usually means a plan cache invalidation after a data distribution change, a new query shape bypassing the compound index, or a schema migration that dropped the indexed field. The slow query log is almost always the first signal of an upcoming incident.

Connection count. Application-side connection pools that are misconfigured create thousands of idle connections, each consuming server memory and file descriptors. Monitor db.serverStatus().connections.current and set explicit pool sizes in your driver configuration — do not rely on driver defaults at production scale:

const client = new MongoClient(uri, {
  maxPoolSize: 50,
  minPoolSize: 5,
  serverSelectionTimeoutMS: 5000
});

What to remember

Run a 3-node PSS replica set as your minimum production unit — spread across three availability zones; a standalone instance is not a database strategy.
Set w:majority with j:true and wtimeoutMS on all write paths where data loss is unacceptable; document any deliberate w:1 exceptions explicitly.
Test failover before production does: rs.stepDown() in a low-traffic window, confirm retryable writes absorb the election window, verify no user-visible errors surface.
Match backup strategy to your RPO: mongodump is adequate only for small datasets or logical exports; filesystem snapshots plus oplog streaming for everything else; Atlas Continuous Cloud Backup for RPO near 1 minute.
Restore your backup on a schedule — quarterly minimum. A backup that has never been restored is an untested assumption, not a recovery plan.
Validate compound indexes with explain(executionStats); apply ESR ordering; use partial indexes for sparse query predicates to keep index size proportional to actual selectivity.
Do not shard until vertical scaling, read distribution to secondaries, and index coverage are all exhausted — the shard key choice is effectively irreversible without a full migration.
Monitor replication lag, oplog window size, WiredTiger cache eviction pressure, and connection count — these four signals cover the majority of production failure modes before they become incidents.