I am on mongo v3.0.x and have a replication set with 3 members, but the second member (and entire server) was unexpectedly taken down and destroyed. I have stood up a new server to use as a replacement for the the second member and need help bringing it back to a working state. I am not sure if this is the proper way to bring a member back in, even though they have the same address (mongochat02).
Do I need to do a rs.reconfig() or remove(mongochat02) and add(mongochat02) again? Or am I supposed to follow a different procedure to get this member working?
When I issue rs.status():
001-rs:PRIMARY> rs.status()
{
"set" : "001-rs",
"date" : ISODate("2017-04-07T19:39:23.860Z"),
"myState" : 1,
"term" : NumberLong(-1),
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "mongochat01:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 174004,
"optime" : Timestamp(1491593961, 4),
"optimeDate" : ISODate("2017-04-07T19:39:21Z"),
"lastHeartbeat" : ISODate("2017-04-07T19:39:22.386Z"),
"lastHeartbeatRecv" : ISODate("2017-04-07T19:39:21.977Z"),
"pingMs" : NumberLong(0),
"syncingTo" : "mongochat03:27017",
"configVersion" : 3
},
{
"_id" : 1,
"name" : "mongochat02:27017",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : Timestamp(0, 0),
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2017-04-07T19:39:23.672Z"),
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
"pingMs" : NumberLong(0),
"lastHeartbeatMessage" : "Operation timed out",
"configVersion" : -1
},
{
"_id" : 2,
"name" : "mongochat03:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 15127641,
"optime" : Timestamp(1491593963, 2),
"optimeDate" : ISODate("2017-04-07T19:39:23Z"),
"electionTime" : Timestamp(1491419961, 1),
"electionDate" : ISODate("2017-04-05T19:19:21Z"),
"configVersion" : 3,
"self" : true
}
],
"ok" : 1
}
While on mongochat02, when I issue rs.status()
> rs.status()
{
"info" : "run rs.initiate(...) if not yet done for the set",
"ok" : 0,
"errmsg" : "no replset config has been received",
"code" : 94
}
Connections seem to work up until the heartbeat. In the logs, there is the following error while on mongochat02:
2017-04-06T10:42:11.831-0600 I REPL [ReplicationExecutor] Error in heartbeat request mongochat01:27017; ExceededTimeLimit: Operation timed out
2017-04-06T10:42:11.911-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:36358 #6671 (151 connections now open)
2017-04-06T10:42:11.947-0600 I REPL [ReplicationExecutor] Error in heartbeat request to mongochat03:27017; ExceededTimeLimit: Operation timed out
This made me question if the members are unable to communicate with one another:
- All the members are able to ping one another
-
But mongochat02 is unable to connect to mongochat03/01 through shell
[root@mongochat02]$ mongo --host mongochat03:27017 MongoDB shell version: 3.2.9 connecting to: mongochat03:27017/test
2017-04-07T14:02:03.411-0600 I NETWORK [thread1] Socket recv() errno:110 Connection timed out mongochat03:27017
2017-04-07T14:02:03.411-0600 I NETWORK [thread1] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_ERROR] server [mongochat03:27017]
Mongo log from mongochat01 (currently secondary)
2017-04-11T21:03:36.264+0000 I REPL [ReplicationExecutor] Member mongochat02:27017 is now in state STARTUP
2017-04-11T21:03:41.779+0000 I NETWORK [initandlisten] connection accepted from 10.8.129.114:33648 #43227 (104 connections now open)
2017-04-11T21:03:47.896+0000 I NETWORK [conn43147] end connection 10.8.129.114:33200 (103 connections now open)
2017-04-11T21:03:48.265+0000 I NETWORK [ReplExecNetThread-2] Socket recv() timeout 10.8.129.114:27017
2017-04-11T21:03:48.265+0000 I NETWORK [ReplExecNetThread-2] SocketException: remote: 10.8.129.114:27017 error: 9001 socket exception [RECV_TIMEOUT] server [10.8.129.114:27017]
2017-04-11T21:03:48.265+0000 I NETWORK [ReplExecNetThread-2] DBClientCursor::init call() failed
2017-04-11T21:03:48.265+0000 I REPL [ReplicationExecutor] Error in heartbeat request to mongochat02:27017; Location10276 DBClientBase::findN: transport error: mongochat02:27017 ns: admin.$cmd query: { replSetHeartbeat: "001-rs", pv: 1, v: 3, from: "mongochat01:27017", fromId: 0, checkEmpty: false }
2017-04-11T21:03:50.380+0000 I REPL [ReplicationExecutor] Member mongochat02:27017 is now in state STARTUP
2017-04-11T21:03:53.778+0000 I NETWORK [initandlisten] connection accepted from 10.8.129.114:33654 #43228 (104 connections now open)
mongochat03 log (curent Primary)
2017-04-11T21:15:45.955+0000 I REPL [ReplicationExecutor] Error in heartbeat request to mongochat02:27017; ExceededTimeLimit: Operation timed out
2017-04-11T21:15:46.231+0000 I NETWORK [initandlisten] connection accepted from 10.8.129.114:55282 #1148708 (108 connections now open)
2017-04-11T21:15:48.041+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongochat02:27017
2017-04-11T21:15:52.764+0000 I NETWORK [conn1148596] end connection 10.8.129.114:54834 (107 connections now open)
2017-04-11T21:15:57.956+0000 I REPL [ReplicationExecutor] Error in heartbeat request to mongochat02:27017; ExceededTimeLimit: Operation timed out
2017-04-11T21:15:58.231+0000 I NETWORK [initandlisten] connection accepted from 10.8.129.114:55288 #1148709 (108 connections now open)
2017-04-11T21:15:59.902+0000 I NETWORK [conn1148706] end connection 10.1.240.185:58407 (107 connections now open)
2017-04-11T21:15:59.933+0000 I NETWORK [initandlisten] connection accepted from 10.1.240.185:58560 #1148710 (108 connections now open)
2017-04-11T21:16:00.041+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongochat02:27017
2017-04-11T21:16:04.540+0000 I NETWORK [conn1148597] end connection 10.8.129.114:54842 (107 connections now open)
mongo.conf from monogchat01. They are all pretty much the same except the "bindIp" is the server ip.
# mongod.conf
# for documentation of all options, see:
# http://docs.mongodb.org/manual/reference/configuration-options/
# where to write logging data.
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
# Where and how to store data.
storage:
dbPath: /opt/mongo
journal:
enabled: true
# engine:
# mmapv1:
# wiredTiger:
# how the process runs
processManagement:
fork: true # fork and run in background
pidFilePath: /var/run/mongodb/mongod.pid # location of pidfile
# network interfaces
net:
port: 27017
bindIp: 127.0.0.1,10.1.240.185 # Listen to local interface only, comment to listen on all interfaces.
#security:
#operationProfiling:
#replication:
replication:
replSetName: "001-rs"
#sharding:
## Enterprise-Only Options
#auditLog:
#snmp:
Logs on mongochat02
2017-04-12T09:00:18.782-0600 I CONTROL [main] ***** SERVER RESTARTED *****
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] MongoDB starting : pid=27788 port=27017 dbpath=/opt/mongo 64-bit host=den02vmchat02
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] db version v3.2.9
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] git version: 22ec9e93b40c85fc7cae7d56e7d6a02fd811088c
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] allocator: tcmalloc
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] modules: none
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] build environment:
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] distmod: rhel70
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] distarch: x86_64
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] target_arch: x86_64
2017-04-12T09:00:18.789-0600 I CONTROL [initandlisten] options: { config: "/etc/mongod.conf", net: { port: 27017 }, processManagement: { fork: true, pidFilePath: "/var/run/mongodb/mongod.pid" }, replication: { replSetName: "001-rs" }, storage: { dbPath: "/opt/mongo", journal: { enabled: true } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2017-04-12T09:00:18.809-0600 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=3G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten]
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten]
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten]
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten] ** WARNING: soft rlimits too low. rlimits set to 4096 processes, 64000 files. Number of processes should be at least 32000 : 0.5 times number of files.
2017-04-12T09:00:19.631-0600 I CONTROL [initandlisten]
2017-04-12T09:00:19.814-0600 I REPL [initandlisten] Did not find local voted for document at startup; NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
2017-04-12T09:00:19.814-0600 I REPL [initandlisten] Did not find local replica set configuration document at startup; NoMatchingDocument: Did not find replica set configuration document in local.system.replset
2017-04-12T09:00:19.814-0600 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/opt/mongo/diagnostic.data'
2017-04-12T09:00:19.814-0600 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2017-04-12T09:00:20.185-0600 I NETWORK [initandlisten] waiting for connections on port 27017
2017-04-12T09:00:20.561-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.207:57560 #1 (1 connection now open)
2017-04-12T09:00:20.603-0600 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongochat03:27017
2017-04-12T09:00:21.247-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51255 #2 (2 connections now open)
2017-04-12T09:00:21.291-0600 I NETWORK [conn2] end connection 10.1.240.185:51255 (1 connection now open)
2017-04-12T09:00:21.424-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51256 #3 (2 connections now open)
2017-04-12T09:00:21.528-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51257 #4 (3 connections now open)
2017-04-12T09:00:21.632-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51258 #5 (4 connections now open)
2017-04-12T09:00:21.736-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51259 #6 (5 connections now open)
2017-04-12T09:00:21.834-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51260 #7 (6 connections now open)
2017-04-12T09:00:21.839-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51261 #8 (7 connections now open)
2017-04-12T09:00:21.877-0600 I NETWORK [conn7] end connection 10.1.240.185:51260 (6 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK [conn3] end connection 10.1.240.185:51256 (5 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK [conn4] end connection 10.1.240.185:51257 (4 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK [conn5] end connection 10.1.240.185:51258 (3 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK [conn6] end connection 10.1.240.185:51259 (2 connections now open)
2017-04-12T09:00:21.882-0600 I NETWORK [conn8] end connection 10.1.240.185:51261 (1 connection now open)
2017-04-12T09:00:21.929-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51262 #9 (2 connections now open)
2017-04-12T09:00:21.929-0600 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to chat.zayo.us:27017
2017-04-12T09:00:21.947-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51263 #10 (3 connections now open)
2017-04-12T09:00:22.051-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51264 #11 (4 connections now open)
2017-04-12T09:00:22.154-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51265 #12 (5 connections now open)
2017-04-12T09:00:22.258-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51266 #13 (6 connections now open)
2017-04-12T09:00:22.362-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51267 #14 (7 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK [conn10] end connection 10.1.240.185:51263 (6 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK [conn12] end connection 10.1.240.185:51265 (6 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK [conn13] end connection 10.1.240.185:51266 (4 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK [conn14] end connection 10.1.240.185:51267 (3 connections now open)
2017-04-12T09:00:22.406-0600 I NETWORK [conn11] end connection 10.1.240.185:51264 (6 connections now open)
2017-04-12T09:00:22.990-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51268 #15 (3 connections now open)
2017-04-12T09:00:23.022-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51270 #16 (4 connections now open)
2017-04-12T09:00:23.051-0600 I NETWORK [conn15] end connection 10.1.240.185:51268 (3 connections now open)
2017-04-12T09:00:23.092-0600 I NETWORK [conn16] end connection 10.1.240.185:51270 (2 connections now open)
2017-04-12T09:00:25.397-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51271 #17 (3 connections now open)
2017-04-12T09:00:25.441-0600 I NETWORK [conn17] end connection 10.1.240.185:51271 (2 connections now open)
2017-04-12T09:00:25.776-0600 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongochat03:27017
2017-04-12T09:00:25.861-0600 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to chat.zayo.us:27017
2017-04-12T09:00:25.994-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51272 #18 (3 connections now open)
2017-04-12T09:00:26.003-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51273 #19 (4 connections now open)
2017-04-12T09:00:26.037-0600 I NETWORK [conn18] end connection 10.1.240.185:51272 (3 connections now open)
2017-04-12T09:00:26.107-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51274 #20 (4 connections now open)
2017-04-12T09:00:26.211-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51275 #21 (5 connections now open)
2017-04-12T09:00:26.314-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51276 #22 (6 connections now open)
2017-04-12T09:00:26.418-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51277 #23 (7 connections now open)
2017-04-12T09:00:26.461-0600 I NETWORK [conn19] end connection 10.1.240.185:51273 (6 connections now open)
2017-04-12T09:00:26.461-0600 I NETWORK [conn20] end connection 10.1.240.185:51274 (5 connections now open)
2017-04-12T09:00:26.461-0600 I NETWORK [conn21] end connection 10.1.240.185:51275 (4 connections now open)
2017-04-12T09:00:26.462-0600 I NETWORK [conn22] end connection 10.1.240.185:51276 (3 connections now open)
2017-04-12T09:00:26.462-0600 I NETWORK [conn23] end connection 10.1.240.185:51277 (2 connections now open)
2017-04-12T09:00:26.514-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51278 #24 (3 connections now open)
2017-04-12T09:00:26.618-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51279 #25 (4 connections now open)
2017-04-12T09:00:26.722-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51280 #26 (5 connections now open)
2017-04-12T09:00:26.826-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51281 #27 (6 connections now open)
2017-04-12T09:00:26.931-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51282 #28 (7 connections now open)
2017-04-12T09:00:26.974-0600 I NETWORK [conn24] end connection 10.1.240.185:51278 (6 connections now open)
2017-04-12T09:00:26.974-0600 I NETWORK [conn25] end connection 10.1.240.185:51279 (5 connections now open)
2017-04-12T09:00:26.974-0600 I NETWORK [conn26] end connection 10.1.240.185:51280 (4 connections now open)
2017-04-12T09:00:26.975-0600 I NETWORK [conn27] end connection 10.1.240.185:51281 (3 connections now open)
2017-04-12T09:00:26.975-0600 I NETWORK [conn28] end connection 10.1.240.185:51282 (3 connections now open)
2017-04-12T09:00:27.161-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51283 #29 (3 connections now open)
2017-04-12T09:00:27.200-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51284 #30 (4 connections now open)
2017-04-12T09:00:27.203-0600 I NETWORK [conn29] end connection 10.1.240.185:51283 (3 connections now open)
2017-04-12T09:00:27.244-0600 I NETWORK [conn30] end connection 10.1.240.185:51284 (2 connections now open)
2017-04-12T09:00:29.553-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51285 #31 (3 connections now open)
2017-04-12T09:00:29.596-0600 I NETWORK [conn31] end connection 10.1.240.185:51285 (2 connections now open)
2017-04-12T09:00:30.145-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51286 #32 (3 connections now open)
2017-04-12T09:00:30.188-0600 I NETWORK [conn32] end connection 10.1.240.185:51286 (2 connections now open)
2017-04-12T09:00:30.569-0600 I NETWORK [initandlisten] connection accepted from 10.1.240.185:51288 #33 (3 connections now open)
2017-04-12T09:00:30.603-0600 I REPL [ReplicationExecutor] Error in heartbeat request to mongochat03:27017; ExceededTimeLimit: Operation timed out