I'm trying to add a new 3.4 (RocksDB engine) member to a ~2 TB replica set currently consisting of 3.2 (WiredTiger engine) members and running into problems with the initial sync. Because of the different storage engine, the common suggestion of using a filesystem snapshot to bootstrap the new member won't work.
Replication starts fine, but eventually runs into InvalidSyncSource errors. Strangely enough, the log entries appear to claim that the OpTime of the primary server is more than 30 seconds behind that of the secondary, which sounds impossible to me. I have confirmed that the primary server remained in state PRIMARY throughout the entire operation, and that the clocks on all hosts are in sync.
First indication that something's not right (initial sync source is db-primary.xxxxxxxxxxxxxxx:27017):
2017-04-28T10:58:52.183+0200 I REPL [replication-5] re-evaluating sync source because our current sync source's most recent OpTime is { ts: Timestamp 1493369879000|2, t: -1 } which is more than 30s behind member db-secondary.xxxxxxxxxxxxxxx:27017 whose most recent OpTime is { ts: Timestamp 1493369931000|2, t: -1 }
2017-04-28T10:58:52.183+0200 I REPL [replication-5] Finished fetching oplog during initial sync: InvalidSyncSource: sync source db-primary.xxxxxxxxxxxxxxx:27017 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid. Last fetched optime and hash: { ts: Timestamp 1493369932000|1, t: -1 }[3606797002863800866]
2017-04-28T10:58:52.183+0200 E REPL [replication-5] Error fetching oplog during initial sync: InvalidSyncSource: sync source db-primary.xxxxxxxxxxxxxxx:27017 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
I'm not sure if the "Timestamp 0|0" in the 2nd & 3rd lines here are meaningful but they also look strange to me. Mongo happily continues to sync until it has copied all databases, after which this happens:
2017-04-29T01:16:52.317+0200 E REPL [initial sync-0] Initial sync attempt failed -- attempts left: 9 cause: InvalidSyncSource: sync source db-primary.xxxxxxxxxxxxxxx:27017 (last optime: { ts: Timestamp 0|0, t: -1 }; sync source index: -1; primary index: -1) is no longer valid
2017-04-29T01:16:53.379+0200 I REPL [initial sync-0] sync source candidate: db-primary.xxxxxxxxxxxxxxx:27017
What could be causing the initial sync to fail in this way? The only thing I can come up with is that an attempt to read the OpTime from the primary hangs for quite a while and returns late enough to be compared to a much more recent reading from the secondary, but I would think/hope that Mongo would account for that. In addition, there is a reliable 1 Gbit network link between all hosts involved, with no indication that anything went wrong on the network level.
Regardless of possible explanations, can anyone suggest an approach that might prevent this issue from occurring?
Note that I have succesfully performed this operation on a much smaller (~300 GB) replica set, but that might have just been lucky if the problem occurs randomly.