So I encountered the following issue recently:
I have a 5-member set replica set (priority)
- 1 x primary (2)
- 2 x secondary (0.5)
- 1 x hidden backup (0)
- 1 x arbiter (0)
One of the secondary replicas with 0.5 priority (let's call it B) encountered some network issue and had intermittent connectivity with the rest of the replica set. However, despite having staler data and a lower priority than the existing primary (let's call it A) it assumed primary role:
[ReplicationExecutor] VoteRequester: Got no vote from xxx because: candidate's data is staler than mine, resp:{ term: 29, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
[ReplicationExecutor] election succeeded, assuming primary role in term 29
[ReplicationExecutor] transition to PRIMARY
And for A, despite not having any connection issues with the rest of the replica set:
[ReplicationExecutor] stepping down from primary, because a new term has begun: 29
So Question 1 is, how could this have been possible given the circumstances?
Moving on, A (now a secondary) began rolling back data:
[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 28, timestamp: xxx). source's GTE: (term: 29, timestamp: xxx) hashes: (xxx/xxx)
[rsBackgroundSync] beginning rollback
[rsBackgroundSync] rollback 0
[ReplicationExecutor] transition to ROLLBACK
This caused data which was written to be removed. So Question 2 is: How does an OplogStart go missing?
Last but not least, Question 3, how can this be prevented?
Thank you in advance!