I have a replica set and I added a new member to the set. The initialSync begins for the new member and rs.status (on primary) shows STARTUP2
as status. However, after a long enough time, there's a cryptic fassert
error coming on the new instance.
Log dump is as following:
2014-11-02T22:53:23.995+0000 [clientcursormon] mem (MB) res:330 virt:45842
2014-11-02T22:53:23.995+0000 [clientcursormon] mapped (incl journal view):45038
2014-11-02T22:53:23.995+0000 [clientcursormon] connections:27
2014-11-02T22:53:23.995+0000 [clientcursormon] replication threads:32
2014-11-02T22:53:25.427+0000 [conn2012] end connection xx.xx.xx.xx:1201 (26 connections now open)
2014-11-02T22:53:25.433+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1200 #2014 (27 connections now open)
2014-11-02T22:53:25.436+0000 [conn2014] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-02T22:53:26.775+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1058 #2015 (28 connections now open)
2014-11-02T22:53:26.864+0000 [conn1993] end connection xx.xx.xx.xx:1059 (27 connections now open)
2014-11-02T22:53:29.090+0000 [rsSync] Socket recv() errno:110 Connection timed out xx.xx.xx.xx:27017
2014-11-02T22:53:29.096+0000 [rsSync] SocketException: remote: xx.xx.xx.xx:27017 error: 9001 socket exception [RECV_ERROR] server [168.63.252.61:27017]
2014-11-02T22:53:29.099+0000 [rsSync] DBClientCursor::init call() failed
2014-11-02T22:53:29.307+0000 [rsSync] replSet initial sync exception: 13386 socket error for mapping query 0 attempts remaining
2014-11-02T22:53:36.113+0000 [conn2013] end connection xx.xx.xx.xx:1056 (26 connections now open)
2014-11-02T22:53:36.153+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1137 #2016 (27 connections now open)
2014-11-02T22:53:36.154+0000 [conn2016] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-02T22:53:55.541+0000 [conn2014] end connection xx.xx.xx.xx:1200 (26 connections now open)
2014-11-02T22:53:55.578+0000 [initandlisten] connection accepted from xx.xx.xx.xx:1201 #2017 (27 connections now open)
2014-11-02T22:53:55.580+0000 [conn2017] authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-02T22:53:56.861+0000 [conn2015] authenticate db: admin { authenticate: 1, user: "root", nonce: "xxx", key: "xxx" }
2014-11-02T22:53:59.310+0000 [rsSync] Fatal Assertion 16233
2014-11-02T22:53:59.723+0000 [rsSync] 0x11c0e91 0x1163109 0x114576d 0xe84c1f 0xea770e 0xea7800 0xea7af8 0x1205829 0x7ff728cf8e9a 0x7ff72800b3fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
/usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x114576d]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xe84c1f]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x18e) [0xea770e]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x30) [0xea7800]
/usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0xa8) [0xea7af8]
/usr/bin/mongod() [0x1205829]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7ff728cf8e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff72800b3fd]
2014-11-02T22:53:59.723+0000 [rsSync]
***aborting after fassert() failure
2014-11-02T22:53:59.728+0000 [rsSync] SEVERE: Got signal: 6 (Aborted)
.
The worst part is that when I try to re-start the mongod
service, the replication begins afresh trying to reSync all the files which are already there. This seems bizarre and useless.
Can someone please help me understand what is going on?