Quantcast
Channel: StackExchange Replication Questions
Viewing all articles
Browse latest Browse all 17268

MongoDB secondary completely unable to keep up

$
0
0

We have set up a three member replica set consisting of the following, all on MongoDB version 3.4:

  • Primary. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
  • Secondary. Amazon EC2, Windows Server 2016, m4.2xlarge, 32 GB RAM, 8 vCPUs. Hosted in Germany.
  • Arbiter. Tiny cloud based Linux instance.

The problem we are seeing is that the secondary is unable to keep up with the primary. As we seed it with data (copy over from the primary) and add it to the replica set, it typically manages to get in sync, but an hour later it might lag behind by 10 minutes; a few hours later, it's an hour behind, and so on, until a day or two later, it goes stale.

We are trying to figure out why this is. The primary is consistently using 0-1% CPU, while the secondary is under constant heavy load at 20-80% CPU. This seems to be the only potential resource constraint. Disk and network load does not seem to be an issue. There seems to be some locking going on on the secondary, as operations in the mongo shell (such as db.getReplicationInfo()) frequently takes 5 minutes or more to complete, and mongostat rarely works (it just says i/o timeout). Here is output from mongostat during a rare instance when it reported stats for the secondary:

                   host insert query update delete getmore command dirty  used flushes vsize   res qrw  arw net_in net_out conn set repl                time
        localhost:27017     *0    33    743     *0       0   166|0  1.0% 78.7%       0 27.9G 27.0G 0|0  0|1  2.33m    337k  739  rs  PRI Mar 27 14:41:54.578
  primary.XXX.com:27017     *0    36    825     *0       0   131|0  1.0% 78.7%       0 27.9G 27.0G 0|0  0|0  1.73m    322k  739  rs  PRI Mar 27 14:41:53.614
secondary.XXX.com:27017     *0    *0     *0     *0       0   109|0  4.3% 80.0%       0 8.69G 7.54G 0|0 0|10  6.69k    134k  592  rs  SEC Mar 27 14:41:53.673

I ran db.serverStatus() on the secondary, and compared to the primary, and one number that stood out was the following:

"locks" : {"Global" : {"timeAcquiringMicros" : {"r" : NumberLong("21188001783")

The secondary had an uptime of 14000 seconds at the time.

Would appreciate any ideas on what this could be, or how to debug this issue! We could upgrade the Amazon instance to something beefier, but we've done that three times already, and at this point we figure that something else must be wrong.

I'll include output from db.currentOp() on the secondary below, in case it helps. (That command took 5 minutes to run, after which the following was logged: Restarting oplog query due to error: CursorNotFound: Cursor not found, cursor id: 15728290121. Last fetched optime (with hash): { ts: Timestamp 1490613628000|756, t: 48 }[-5363878314895774690]. Restarts remaining: 3)

"desc":"conn605",
"connectionId":605,"client":"127.0.0.1:61098",
"appName":"MongoDB Shell",
"secs_running":0,
"microsecs_running":NumberLong(16),
"op":"command",
"ns":"admin.$cmd",
"query":{"currentOp":1},
"locks":{},
"waitingForLock":false,
"lockStats":{}

"desc":"repl writer worker 10",
"secs_running":0,
"microsecs_running":NumberLong(14046),
"op":"none",
"ns":"CustomerDB.ed2112ec779f",
"locks":{"Global":"W","Database":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"w":NumberLong(1),"W":NumberLong(1)}},"Database":{"acquireCount":{"W":NumberLong(1)}}}

"desc":"ApplyBatchFinalizerForJournal",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}

"desc":"ReplBatcher",
"secs_running":11545,
"microsecs_running":NumberLong("11545663961"),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(2)}},"Database":{"acquireCount":{"r":NumberLong(1)}},"oplog":{"acquireCount":{"r":NumberLong(1)}}}

"desc":"rsBackgroundSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545281690"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(5),"w":NumberLong(1)}},"Database":{"acquireCount":{"r":NumberLong(2),"W":NumberLong(1)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}

"desc":"TTLMonitor",
"op":"none",
"ns":"",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(35)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(341534123)}},"Database":{"acquireCount":{"r":NumberLong(17)}},"Collection":{"acquireCount":{"r":NumberLong(17)}}}

"desc":"SyncSourceFeedback",
"op":"none",
"ns":"",
"locks":{},
"waitingForLock":false,
"lockStats":{}

"desc":"WT RecordStoreThread: local.oplog.rs",
"secs_running":1163,
"microsecs_running":NumberLong(1163137036),
"op":"none",
"ns":"local.oplog.rs",
"locks":{},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(1),"w":NumberLong(1)}},"Database":{"acquireCount":{"w":NumberLong(1)}},"oplog":{"acquireCount":{"w":NumberLong(1)}}}

"desc":"rsSync",
"secs_running":11545,
"microsecs_running":NumberLong("11545663926"),
"op":"none",
"ns":"local.replset.minvalid",
"locks":{"Global":"W"},
"waitingForLock":false,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(272095),"w":NumberLong(298255),"R":NumberLong(1),"W":NumberLong(74564)},"acquireWaitCount":{"W":NumberLong(3293)},"timeAcquiringMicros":{"W":NumberLong(17685)}},"Database":{"acquireCount":{"r":NumberLong(197529),"W":NumberLong(298255)},"acquireWaitCount":{"W":NumberLong(146)},"timeAcquiringMicros":{"W":NumberLong(651947)}},"Collection":{"acquireCount":{"r":NumberLong(2)}}}

"desc":"clientcursormon",
"secs_running":0,
"microsecs_running":NumberLong(15649),
"op":"none",
"ns":"CustomerDB.b72ac80177ef",
"locks":{"Global":"r"},
"waitingForLock":true,
"lockStats":{"Global":{"acquireCount":{"r":NumberLong(387)},"acquireWaitCount":{"r":NumberLong(2)},"timeAcquiringMicros":{"r":NumberLong(397538606)}},"Database":{"acquireCount":{"r":NumberLong(193)}},"Collection":{"acquireCount":{"r":NumberLong(193)}}}}],"ok":1}

Viewing all articles
Browse latest Browse all 17268

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>