I have following setup for a sharded replica set in Amazon VPC:
mongo1: 8G RAM Duo core (Primary)
mongo2: 8G RAM Duo core (Secondary)
mongo3: 4G RAM (Arbiter)
Mongo1 is the primary member in the replica set with a 2 shard setup:
mongod --port 27000 --dbpath /mongo/config -- configsvr
mongod --port 27001 --dbpath /mongo/shard1 --shardsvr --replSet rssh1
mongod --port 27002 --dbpath /mongo/shard2 --shardsvr --replSet rssh2
Mongo2 is the secondary member in the replica set, mirrors mongo1 exactly:
mongod --port 27000 --dbpath /mongo/config -- configsvr
mongod --port 27001 --dbpath /mongo/shard1 --shardsvr --replSet rssh1 # Faulty process
mongod --port 27002 --dbpath /mongo/shard2 --shardsvr --replSet rssh2
Then for some reason, the 27001 process on mongo2 experienced a crash due to out of memory (cause unknown) last week. When I discovered the issue (the application still works getting data from the primary) and restarted the 27001 process, it was too late to catch up with the shard1 on mongo1. So I followed 10gen's recommendation:
- emptied directory /mongo/shard1
-
restart the 27001 process using command
mongod --port 27001 --dbpath /mongo/shard1 --shardsvr --replSet rssh1
However it's 24+ hours now, the node is still in STARTUP2
mode, I have about 200G data in the shard1
and it appears that it got about 160G over to /mongo/shard1
on mongo2. Following is the replica set status command output(run on mongo2
)
rssh1:STARTUP2> rs.status()
{
"set" : "rssh1",
"date" : ISODate("2012-10-29T19:28:49Z"),
"myState" : 5,
"syncingTo" : "mongo1:27001",
"members" : [
{
"_id" : 1,
"name" : "mongo1:27001",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 99508,
"optime" : Timestamp(1351538896000, 3),
"optimeDate" : ISODate("2012-10-29T19:28:16Z"),
"lastHeartbeat" : ISODate("2012-10-29T19:28:48Z"),
"pingMs" : 0
},
{
"_id" : 2,
"name" : "mongo2:27001",
"health" : 1,
"state" : 5,
"stateStr" : "STARTUP2",
"uptime" : 99598,
"optime" : Timestamp(1351442134000, 1),
"optimeDate" : ISODate("2012-10-28T16:35:34Z"),
"self" : true
},
{
"_id" : 3,
"name" : "mongoa:27901",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 99508,
"lastHeartbeat" : ISODate("2012-10-29T19:28:48Z"),
"pingMs" : 0
}
],
"ok" : 1
}
rssh1:STARTUP2>
It would appear most of the data from primary was replicated, but not all. The logs shows some error but I don't know if it's related:
Mon Oct 29 19:39:59 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:config.system.indexes query:{ expireAfterSeconds: { $exists: true } } Mon Oct 29 19:39:59 [TTLMonitor] problem detected during query over config.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 } Mon Oct 29 19:39:59 [TTLMonitor] ERROR: error processing ttl for db: config 10065 invalid parameter: expected an object () Mon Oct 29 19:39:59 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:gf2.system.indexes query:{ expireAfterSeconds: { $exists: true } } Mon Oct 29 19:39:59 [TTLMonitor] problem detected during query over gf2.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 } Mon Oct 29 19:39:59 [TTLMonitor] ERROR: error processing ttl for db: gf2 10065 invalid parameter: expected an object () Mon Oct 29 19:39:59 [TTLMonitor] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:kombu_default.system.indexes query:{ expireAfterSeconds: { $exists: true } } Mon Oct 29 19:39:59 [TTLMonitor] problem detected during query over kombu_default.system.indexes : { $err: "not master or secondary; cannot currently read from this replSet member", code: 13436 } Mon Oct 29 19:39:59 [TTLMonitor] ERROR: error processing ttl for db: kombu_default 10065 invalid parameter: expected an object ()
Everything on primary appeared to be fine. No errors in the log.
I tried the steps twice, one with the mongo config server running and one with mongo config server down, both are same results.
This is a production setup and I really need to get the replica set back up working, any help is much much appreciated.