I have MongoDB 3.0.11 replica set 2 + 1 arbiter. Recently I have experienced networking issues on the primary, but secondary didn't become primary due to "member is more than 10 seconds behind the most up-to-date member (mask 0xA)" However, MMS monitoring shows replication lag not higher than 3 sec. The database has about 100 updates per second. Around 400 connection. No page faults. What is wrong here?
The replica set configuration:
mssp:PRIMARY> rs.config()
{
"_id" : "mssp",
"version" : 44459,
"members" : [
{
"_id" : 0,
"host" : "db-prod-1:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 10,
"tags" : {
},
"slaveDelay" : 0,
"votes" : 1
},
{
"_id" : 1,
"host" : "db-prod-2:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : 0,
"votes" : 1
},
{
"_id" : 2,
"host" : "db-prod-arbiter-1:27017",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : 0,
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatTimeoutSecs" : 10,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
}
}
}
The logs of each member: db-prod-1
2016-07-22T14:35:53.841+0000 I NETWORK [ReplExecNetThread-10] getaddrinfo("db-prod-arbiter-1") failed: Temporary failure in name resolution
2016-07-22T14:35:53.841+0000 I REPL [ReplicationExecutor] Error in heartbeat request to db-prod-arbiter-1:27017; Location18915 Failed attempt to connect to db-prod-arbiter-1:27017; couldn't initialize connection to host db-prod-arbiter-1, address is invalid
2016-07-22T14:35:53.942+0000 I NETWORK [ReplExecNetThread-11] getaddrinfo("db-prod-2") failed: Temporary failure in name resolution
2016-07-22T14:35:53.942+0000 I REPL [ReplicationExecutor] Error in heartbeat request to db-prod-2:27017; Location18915 Failed attempt to connect to db-prod-2:27017; couldn't initialize connection to host db-prod-2, address is invalid
2016-07-22T14:35:53.942+0000 I REPL [ReplicationExecutor] can't see a majority of the set, relinquishing primary
2016-07-22T14:35:53.942+0000 I REPL [ReplicationExecutor] Stepping down from primary in response to heartbeat
2016-07-22T14:35:53.943+0000 I REPL [replCallbackWithGlobalLock-0] transition to SECONDARY
db-prod-2
2016-07-22T14:35:53.952+0000 E REPL [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: db-prod-1:27017
2016-07-22T14:35:53.952+0000 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for db-prod-1:27017 (10.240.0.2) failed) in destructor (kill)
2016-07-22T14:35:53.952+0000 I REPL [ReplicationExecutor] could not find member to sync from
2016-07-22T14:35:54.047+0000 I REPL [ReplicationExecutor] Member db-prod-1:27017 is now in state SECONDARY
2016-07-22T14:35:54.047+0000 I REPL [ReplicationExecutor] Standing for election
2016-07-22T14:35:54.048+0000 I REPL [ReplicationExecutor] not electing self, db-prod-1:27017 would veto with 'I don't think db-prod-2:27017 is electable because the member is not currently a secondary; member is more than 10 seconds behind the most up-to-date member (mask 0xA)'
2016-07-22T14:35:54.048+0000 I REPL [ReplicationExecutor] not electing self, we are not freshest
db-prod-arbiter-1
2016-07-22T14:35:55.639+0000 I REPL [ReplicationExecutor] Member db-prod-1:27017 is now in state SECONDARY