Quantcast
Channel: StackExchange Replication Questions
Viewing all articles
Browse latest Browse all 17268

Mongo replicated shard member not able to recover, stuck in STARTUP2 mode

$
0
0

I have following setup for a sharded replica set in Amazon VPC:

mongo1: 8G RAM Duo core (Primary)

mongo2: 8G RAM Duo core (Secondary)

mongo3: 4G RAM (Arbiter)

Mongo1 is the primary member in the replica set with a 2 shard setup:

 mongod --port 27000 --dbpath /mongo/config -- configsvr 

 mongod --port 27001 --dbpath /mongo/shard1 --shardsvr --replSet rssh1

 mongod --port 27002 --dbpath /mongo/shard2 --shardsvr --replSet rssh2

Mongo2 is the secondary member in the replica set, mirrors mongo1 exactly:

 mongod --port 27000 --dbpath /mongo/config -- configsvr 

 mongod --port 27001 --dbpath /mongo/shard1 --shardsvr --replSet rssh1   # Faulty process

 mongod --port 27002 --dbpath /mongo/shard2 --shardsvr --replSet rssh2

Then for some reason, the 27001 process on mongo2 experienced a crash due to out of memory (cause unknown) last week. When I discovered the issue (the application still works getting data from the primary) and restarted the 27001 process, it was too late to catch up with the shard1 on mongo1. So I followed 10gen's recommendation:

  • emptied directory /mongo/shard1
  • restart the 27001 process using command

    mongod --port 27001 --dbpath /mongo/shard1 --shardsvr --replSet rssh1
    

However it's 24+ hours now, the node is still in STARTUP2 mode, I have about 200G data in the shard1 and it appears that it got about 160G over to /mongo/shard1 on mongo2. Following is the replica set status command output(run on mongo2)

rssh1:STARTUP2> rs.status()
{
     "set" : "rssh1",
     "date" : ISODate("2012-10-29T19:28:49Z"),
     "myState" : 5,
     "syncingTo" : "mongo1:27001",
     "members" : [
          {
               "_id" : 1,
               "name" : "mongo1:27001",
               "health" : 1,
               "state" : 1,
               "stateStr" : "PRIMARY",
               "uptime" : 99508,
               "optime" : Timestamp(1351538896000, 3),
               "optimeDate" : ISODate("2012-10-29T19:28:16Z"),
               "lastHeartbeat" : ISODate("2012-10-29T19:28:48Z"),
               "pingMs" : 0
          },
          {
               "_id" : 2,
               "name" : "mongo2:27001",
               "health" : 1,
               "state" : 5,
               "stateStr" : "STARTUP2",
               "uptime" : 99598,
               "optime" : Timestamp(1351442134000, 1),
               "optimeDate" : ISODate("2012-10-28T16:35:34Z"),
               "self" : true
          },
          {
               "_id" : 3,  
               "name" : "mongoa:27901",
               "health" : 1,
               "state" : 7,
               "stateStr" : "ARBITER",
               "uptime" : 99508,
               "lastHeartbeat" : ISODate("2012-10-29T19:28:48Z"),
               "pingMs" : 0
          }
     ],
     "ok" : 1
}

rssh1:STARTUP2> 

It would appear most of the data from primary was replicated, but not all. The logs shows some error but I don't know if it's related:

Mon Oct 29 19:39:59 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:config.system.indexes query:{ expireAfterSeconds: { $exists: true } }

Mon Oct 29 19:39:59 [TTLMonitor] problem detected during query over config.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 }

Mon Oct 29 19:39:59 [TTLMonitor] ERROR: error processing ttl for db: config 10065 invalid parameter: expected an object ()

Mon Oct 29 19:39:59 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:gf2.system.indexes query:{ expireAfterSeconds: { $exists: true } }

Mon Oct 29 19:39:59 [TTLMonitor] problem detected during query over gf2.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 }

Mon Oct 29 19:39:59 [TTLMonitor] ERROR: error processing ttl for db: gf2 10065 invalid parameter: expected an object ()

Mon Oct 29 19:39:59 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:kombu_default.system.indexes query:{ expireAfterSeconds: { $exists: true } }

Mon Oct 29 19:39:59 [TTLMonitor] problem detected during query over kombu_default.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 }

Mon Oct 29 19:39:59 [TTLMonitor] ERROR: error processing ttl for db: kombu_default 10065 invalid parameter: expected an object ()

Everything on primary appeared to be fine. No errors in the log.

I tried the steps twice, one with the mongo config server running and one with mongo config server down, both are same results.

This is a production setup and I really need to get the replica set back up working, any help is much much appreciated.


Viewing all articles
Browse latest Browse all 17268

Trending Articles