During syncing a slave(SECONDRY) node, I have DBClientCursor::init call() failed
error and it's start syncing from scratch, the full log is provided:
Tue Nov 10 05:54:56.281 [rsSync] clone xxxxx.article_model 1207039
Tue Nov 10 05:54:59.228 [conn1556] end connection 94.186.12.83:41784 (0 connections now open)
Tue Nov 10 05:54:59.323 [initandlisten] connection accepted from 94.186.12.83:41803 #1557 (1 connection now open)
Tue Nov 10 05:55:07.922 [rsSync] 1207495 objects cloned so far from collection xxxxx.article_model
Tue Nov 10 05:55:30.821 [conn1557] end connection 94.186.12.83:41803 (0 connections now open)
Tue Nov 10 05:55:30.915 [initandlisten] connection accepted from 94.186.12.83:41829 #1558 (1 connection now open)
Tue Nov 10 05:56:05.888 [rsHealthPoll] DBClientCursor::init call() failed
Tue Nov 10 05:56:05.888 [rsHealthPoll] replset info xxxxxa1:27017 heartbeat failed, retrying
Tue Nov 10 05:56:06.888 [rsHealthPoll] replSet info xxxxxa1:27017 is down (or slow to respond):
Tue Nov 10 05:56:06.888 [rsHealthPoll] replSet member xxxxxa1:27017 is now in state DOWN
Tue Nov 10 05:56:09.088 [rsHealthPoll] replset info xxxxxa1:27017 thinks that we are down
Tue Nov 10 05:56:09.088 [rsHealthPoll] replSet member xxxxxa1:27017 is up
Tue Nov 10 05:56:09.088 [rsHealthPoll] replSet member xxxxxa1:27017 is now in state SECONDARY
Tue Nov 10 05:56:11.187 [rsHealthPoll] replset info xxxxxa1:27017 thinks that we are down
Tue Nov 10 05:56:13.288 [rsHealthPoll] replset info xxxxxa1:27017 thinks that we are down
Tue Nov 10 05:56:14.297 [initandlisten] connection accepted from 94.186.12.83:41910 #1559 (2 connections now open)
Tue Nov 10 05:56:14.551 [conn1558] end connection 94.186.12.83:41829 (1 connection now open)
Tue Nov 10 05:56:18.499 [conn1559] end connection 94.186.12.83:41910 (0 connections now open)
Tue Nov 10 05:56:21.600 [initandlisten] connection accepted from 94.186.12.83:41925 #1560 (1 connection now open)
Tue Nov 10 05:56:21.795 [conn1560] replSet info voting yea for xxxxxa1:27017 (0)
Tue Nov 10 05:56:23.865 [rsHealthPoll] replSet member xxxxxa1:27017 is now in state PRIMARY
Tue Nov 10 05:56:36.553 [rsSync] Socket flush send() errno:9 Bad file descriptor 94.186.12.83:27017
Tue Nov 10 05:56:36.553 [rsSync] caught exception (socket exception [SEND_ERROR] for 94.186.12.83:27017) in destructor (~PiggyBackData)
Tue Nov 10 05:56:36.553 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 9 attempts remaining
Tue Nov 10 05:56:53.103 [conn1560] end connection 94.186.12.83:41925 (0 connections now open)
Tue Nov 10 05:56:53.199 [initandlisten] connection accepted from 94.186.12.83:41991 #1561 (1 connection now open)
Tue Nov 10 05:57:06.553 [rsSync] replSet initial sync pending
Tue Nov 10 05:57:06.554 [rsSync] replSet syncing to: xxxxxa1:27017
Tue Nov 10 05:57:06.940 [rsSync] replSet initial sync drop all databases
Tue Nov 10 05:57:06.940 [rsSync] dropAllDatabasesExceptLocal 2
Tue Nov 10 05:57:06.947 [rsSync] removeJournalFiles
Tue Nov 10 05:57:07.985 [rsSync] replSet initial sync clone all databases
Tue Nov 10 05:57:08.096 [rsSync] replSet initial sync cloning db: xxxxx
Tue Nov 10 05:57:08.294 [FileAllocator] allocating new datafile /data/db/xxxxx.ns, filling with zeroes...
Tue Nov 10 05:57:08.299 [FileAllocator] done allocating datafile /data/db/xxxxx.ns, size: 16MB, took 0.004 secs
Tue Nov 10 05:57:08.307 [FileAllocator] allocating new datafile /data/db/xxxxx.0, filling with zeroes...
Tue Nov 10 05:57:08.309 [FileAllocator] done allocating datafile /data/db/xxxxx.0, size: 64MB, took 0.001 secs
Tue Nov 10 05:57:08.309 [FileAllocator] allocating new datafile /data/db/xxxxx.1, filling with zeroes...
Tue Nov 10 05:57:08.311 [FileAllocator] done allocating datafile /data/db/xxxxx.1, size: 128MB, took 0.001 secs
Tue Nov 10 05:57:08.881 [rsSync] build index xxxxx.source_model { _id: 1 }
Tue Nov 10 05:57:08.883 [rsSync] fastBuildIndex dupsToDrop:0
Tue Nov 10 05:57:08.883 [rsSync] build index done. scanned 164 total records. 0.001 secs
Tue Nov 10 05:57:08.979 [rsSync] build index xxxxx.category_model { _id: 1 }
Tue Nov 10 05:57:08.980 [rsSync] fastBuildIndex dupsToDrop:0
Tue Nov 10 05:57:08.980 [rsSync] build index done. scanned 37 total records. 0.001 secs
Tue Nov 10 05:57:24.706 [conn1561] end connection 94.186.12.83:41991 (0 connections now open)
Tue Nov 10 05:57:24.805 [initandlisten] connection accepted from 94.186.12.83:42020 #1562 (1 connection now open)
Tue Nov 10 05:57:56.314 [conn1562] end connection 94.186.12.83:42020 (0 connections now open)
Tue Nov 10 05:57:56.414 [initandlisten] connection accepted from 94.186.12.83:42041 #1563 (1 connection now open)
Tue Nov 10 05:58:27.950 [conn1563] end connection 94.186.12.83:42041 (0 connections now open)
Tue Nov 10 05:58:28.046 [initandlisten] connection accepted from 94.186.12.83:42068 #1564 (1 connection now open)
Tue Nov 10 05:58:28.990 [rsSync] 843 objects cloned so far from collection xxxxx.article_model
Tue Nov 10 05:58:53.214 [rsSync] clone xxxxx.article_model 1151
My Primary node data is large(+2M records), Mongo version is 2.4.10, tested it on Debian. This error occurred over 3 times while syncing. How can I fix it?
EDIT: this is my primary server log:
Tue Nov 10 17:24:42.320 [rsHealthPoll] DBClientCursor::init call() failed
Tue Nov 10 17:24:42.320 [rsHealthPoll] replSet info mongo1.mysite.ir:27017 is down (or slow to respond):
Tue Nov 10 17:24:42.320 [rsHealthPoll] replSet member mongo1.mysite.ir:27017 is now in state DOWN
Tue Nov 10 17:24:42.321 [rsMgr] can't see a majority of the set, relinquishing primary
Tue Nov 10 17:24:42.321 [rsMgr] replSet relinquishing primary state
Tue Nov 10 17:24:42.321 [rsMgr] replSet SECONDARY
Tue Nov 10 17:24:42.321 [rsMgr] replSet closing client sockets after relinquishing primary
Tue Nov 10 17:24:42.321 [conn2963] end connection 82.102.13.129:54929 (9 connections now open)
Tue Nov 10 17:24:42.321 [conn2957] end connection 127.0.0.1:33176 (8 connections now open)
Tue Nov 10 17:24:42.321 [conn2958] end connection 127.0.0.1:33178 (7 connections now open)
Tue Nov 10 17:24:42.321 [conn2961] end connection 127.0.0.1:33185 (6 connections now open)
Tue Nov 10 17:24:42.321 [conn2956] end connection 94.182.163.82:60321 (8 connections now open)
Tue Nov 10 17:24:42.321 [conn2960] end connection 127.0.0.1:33184 (4 connections now open)
Tue Nov 10 17:24:42.321 [conn2962] end connection 94.182.163.82:60325 (3 connections now open)
Tue Nov 10 17:24:42.321 [conn2966] end connection 94.182.163.82:60327 (4 connections now open)
Tue Nov 10 17:24:42.321 [conn2964] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [82.102.13.129:54930]
Tue Nov 10 17:24:44.356 [initandlisten] connection accepted from 127.0.0.1:33217 #2967 (2 connections now open)
Tue Nov 10 17:24:44.356 [conn2967] authenticate db: mysite { authenticate: 1, user: "mysite", nonce: "a9f682aa6ef48525", key: "c1a5dded7acbb62ebefb8b0e7fc4eed6" }
Tue Nov 10 17:24:44.357 [conn2967] assertion 13435 not master and slaveOk=false ns:mysite.user_model query:{ _id: ObjectId('55550934608395811e47a3f8') }
Tue Nov 10 17:24:44.357 [conn2967] ntoskip:0 ntoreturn:-1
Tue Nov 10 17:24:44.358 [conn2967] end connection 127.0.0.1:33217 (1 connection now open)
Tue Nov 10 17:24:44.475 [initandlisten] connection accepted from 127.0.0.1:33219 #2968 (2 connections now open)