I am using 4 nodes of Postgres 9.5 server in a Slot replication set up. One node is acting as a Master and three other nodes as Standby nodes. Everything ran fine for few days and suddenly one node lagged behind. The logs on Master and that Standby are as follows-
Master logs -
2015-12-22 16:09:19 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:19 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:20 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:20 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:21 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:21 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:22 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:22 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:23 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:23 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:24 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:24 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:25 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:25 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:26 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:26 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
2015-12-22 16:09:27 IST LOG: could not close temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
2015-12-22 16:09:27 IST LOG: could not close temporary statistics file "pg_stat_tmp/global.tmp": No space left on device
Standby logs-
2015-12-22 17:07:42 IST FATAL: could not write to file "pg_xlog/xlogtemp.2684": No space left on device
2015-12-22 17:07:46 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:08:03 IST FATAL: could not write to file "pg_xlog/xlogtemp.3624": No space left on device
2015-12-22 17:08:18 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:08:26 IST FATAL: could not write to file "pg_xlog/xlogtemp.3080": No space left on device
2015-12-22 17:08:41 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:08:54 IST FATAL: could not write to file "pg_xlog/xlogtemp.2768": No space left on device
2015-12-22 17:09:10 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:09:18 IST FATAL: could not write to file "pg_xlog/xlogtemp.2940": No space left on device
2015-12-22 17:09:33 IST ERROR: could not write to file "pg_xlog/xlogtemp.1964": No space left on device
2015-12-22 17:09:34 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:09:54 IST FATAL: could not write to file "pg_xlog/xlogtemp.3528": No space left on device
2015-12-22 17:10:09 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:10:11 IST FATAL: could not write to file "pg_xlog/xlogtemp.3036": No space left on device
2015-12-22 17:10:14 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:10:15 IST FATAL: could not write to file "pg_xlog/xlogtemp.1708": No space left on device
2015-12-22 17:10:19 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:10:25 IST FATAL: could not write to file "pg_xlog/xlogtemp.1856": No space left on device
2015-12-22 17:10:25 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
2015-12-22 17:10:28 IST FATAL: could not write to file "pg_xlog/xlogtemp.3772": No space left on device
2015-12-22 17:10:30 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1
However, both servers kept running and did not shut down or crash. And then the Standby lagged. I got the following logs on this Standby
2015-12-22 17:13:24 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1 2015-12-22 17:13:24 IST FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000200000058 has already been removed
2015-12-22 17:13:29 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1 2015-12-22 17:13:29 IST FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000200000058 has already been removed
2015-12-22 17:13:34 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1 2015-12-22 17:13:34 IST FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000200000058 has already been removed
2015-12-22 17:13:39 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1 2015-12-22 17:13:39 IST FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000200000058 has already been removed
2015-12-22 17:13:44 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1 2015-12-22 17:13:44 IST FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000200000058 has already been removed
2015-12-22 17:13:49 IST LOG: started streaming WAL from primary at 2/58000000 on timeline 1 2015-12-22 17:13:49 IST FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000200000058 has already been removed
When I issue the query - SELECT * from pg_replication_slots;
, it shows me slot for this Standby as inactive. I am not using archiving. My questions here are as following - 1. Why would this happen as Slot Replication retains WALs until not played by all Replicas? 2. Why would this happen to only one Replica and not others? 3. Also is there any way now to recover this replica?