We have set up a three member replica set consisting of the following, all on MongoDB version 3.4:
- Primary. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
- Secondary. Amazon EC2, Windows Server 2016, m4.2xlarge, 32 GB RAM, 8 vCPUs. Hosted in Germany.
- Arbiter. Tiny cloud based Linux instance.
The problem we are seeing is that the secondary is unable to keep up with the primary. As we seed it with data (copy over from the primary) and add it to the replica set, it typically manages to get in sync, but an hour later it might lag behind by 10 minutes; a few hours later, it's an hour behind, and so on, until a day or two later, it goes stale.
We are trying to figure out why this is. The primary is consistently using 0-1% CPU, while the secondary is under constant heavy load at 20-80% CPU. This seems to be the only potential resource constraint. Disk and network load does not seem to be an issue. There seems to be some locking going on on the secondary, as operations in the mongo shell (such as db.getReplicationInfo()) frequently takes 5 minutes or more to complete, and mongostat rarely works (it just says i/o timeout). Here is output from mongostat during a rare instance when it reported stats for the secondary:
host insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
localhost:27017 *0 33 743 *0 0 166|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|1 2.33m 337k 739 rs PRI Mar 27 14:41:54.578
primary.XXX.com:27017 *0 36 825 *0 0 131|0 1.0% 78.7% 0 27.9G 27.0G 0|0 0|0 1.73m 322k 739 rs PRI Mar 27 14:41:53.614
secondary.XXX.com:27017 *0 *0 *0 *0 0 109|0 4.3% 80.0% 0 8.69G 7.54G 0|0 0|10 6.69k 134k 592 rs SEC Mar 27 14:41:53.673
I ran db.serverStatus() on the secondary, and compared to the primary, and one number that stood out was the following:
"locks" : {"Global" : {"timeAcquiringMicros" : {"r" : NumberLong("21188001783")
The secondary had an uptime of 14000 seconds at the time.
Would appreciate any ideas on what this could be, or how to debug this issue! We could upgrade the Amazon instance to something beefier, but we've done that three times already, and at this point we figure that something else must be wrong.
I'll include output from db.currentOp() on the secondary below, in case it helps. (That command took 5 minutes to run, after which the following was logged: Restarting oplog query due to error: CursorNotFound: Cursor not found, cursor id: 15728290121. Last fetched optime (with hash): { ts: Timestamp 1490613628000|756, t: 48 }[-5363878314895774690]. Restarts remaining: 3)
"desc":"conn605", "connectionId":605,"client":"127.0.0.1:61098", "appName":"MongoDB Shell", "secs_running":0, "microsecs_running":NumberLong(16), "op":"command", "ns":"admin.$cmd", "query":{"currentOp":1}, "locks":{}, "waitingForLock":false, "lockStats":{} "desc":"repl writer worker 10", "secs_running":0, "microsecs_running":NumberLong(14046), "op":"none", "ns":"CustomerDB.ed2112ec779f", "locks":{"Global":"W","Database":"W"}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"w":NumberLong(1),"W":NumberLong(1)}},"Database":{"acquireCount":{"W":NumberLong(1)}}} "desc":"ApplyBatchFinalizerForJournal", "op":"none", "ns":"", "locks":{}, "waitingForLock":false, "lockStats":{} "desc":"ReplBatcher", "secs_running":11545, "microsecs_running":NumberLong("11545663961"), "op":"none", "ns":"local.oplog.rs", "locks":{}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(2)}},"Database":{"acquireCount":{"r":NumberLong(1)}},"oplog":{"acquireCount":{"r":NumberLong(1)}}} "desc":"rsBackgroundSync", "secs_running":11545, "microsecs_running":NumberLong("11545281690"), "op":"none", "ns":"local.replset.minvalid", "locks":{}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(5),"w":NumberLong(1)}},"Database":{"acquireCount":{"r":NumberLong(2),"W":NumberLong(1)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}} "desc":"TTLMonitor", "op":"none", "ns":"", "locks":{"Global":"r"}, "waitingForLock":true, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(35)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(341534123)}},"Database":{"acquireCount":{"r":NumberLong(17)}},"Collection":{"acquireCount":{"r":NumberLong(17)}}} "desc":"SyncSourceFeedback", "op":"none", "ns":"", "locks":{}, "waitingForLock":false, "lockStats":{} "desc":"WT RecordStoreThread: local.oplog.rs", "secs_running":1163, "microsecs_running":NumberLong(1163137036), "op":"none", "ns":"local.oplog.rs", "locks":{}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(1),"w":NumberLong(1)}},"Database":{"acquireCount":{"w":NumberLong(1)}},"oplog":{"acquireCount":{"w":NumberLong(1)}}} "desc":"rsSync", "secs_running":11545, "microsecs_running":NumberLong("11545663926"), "op":"none", "ns":"local.replset.minvalid", "locks":{"Global":"W"}, "waitingForLock":false, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(272095),"w":NumberLong(298255),"R":NumberLong(1),"W":NumberLong(74564)},"acquireWaitCount":{"W":NumberLong(3293)},"timeAcquiringMicros":{"W":NumberLong(17685)}},"Database":{"acquireCount":{"r":NumberLong(197529),"W":NumberLong(298255)},"acquireWaitCount":{"W":NumberLong(146)},"timeAcquiringMicros":{"W":NumberLong(651947)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}} "desc":"clientcursormon", "secs_running":0, "microsecs_running":NumberLong(15649), "op":"none", "ns":"CustomerDB.b72ac80177ef", "locks":{"Global":"r"}, "waitingForLock":true, "lockStats":{"Global":{"acquireCount":{"r":NumberLong(387)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(397538606)}},"Database":{"acquireCount":{"r":NumberLong(193)}},"Collection":{"acquireCount":{"r":NumberLong(193)}}}}],"ok":1}