One of member of my MongoDB replica set decided it would not restart, with the following error (reformatted for readability):
Starting rollback due to OplogStartMissing:
our last op time fetched: (term: 30, timestamp: Jul 28 07:45:11:6)
source's GTE: (term: 31, timestamp: Jul 28 07:45:11:7)
Fatal assertion 18750 UnrecoverableRollbackError
(term: 31, timestamp: Jul 28 07:45:12:2) > our last optime:
(term: 30, timestamp: Jul 28 07:45:11:6)
Let's call the instance where this happens M1, and the source its trying to sync M2. M1 used to be primary, then the primary switched to M2, and M1 restarted.
The naive interpretation of these log messages is that the first operation from M2's oplog is exactly the next operation after what we have applied in M1. So, we should just happily apply operations from M2, but MongoDB tries to rollback some operations, finds an operation in future relative to both what we've applied and what's next on M2, and dies.
I have two questions: first, why is MongoDB trying rollback in the first place, and second, where is operation with timestamp of Jul 28 07:45:12:2
is coming from?