Overview
We have been trying to restore replication between a Master and Replication server for a while, but the individual who had set up has left and we've been working at it for a while. In brief, the issue seems to be that the Replication server starts up and within a few minutes reports postgres: startup process recovering [FILENAME]
and never advances beyond that same FILENAME.
Configuration
Master
pg_hba.conf
local replication replicator trust
host replication replicator 10.1.20.181/32 trust
postgresql.conf
(only active settings related to replication)
wal_level = hot_standby # minimal, archive, hot_standby, or logical
fsync = on # turns forced synchronization on or off
wal_sync_method = fdatasync # the default is the first option
archive_mode = on # allows archiving to be done
archive_command = '/bin/true' # Temporary until replication is working
archive_timeout = 300 # force a logfile segment switch after this
max_wal_senders = 5 # max number of walsender processes
wal_keep_segments = 600 # in logfile segments, 16MB each; 0 disables
max_replication_slots = 3 # max number of replication slots
Replication Server
postgresql.conf
hot_standby = on
recovery.conf
standby_mode = 'on' # enables stand-by (readonly) mode
# Connect to the master postgres server using the replicator user we created.
primary_conninfo = 'host=10.1.20.180 port=5432 user=replicator'
primary_slot_name = 'oh_slot'
# Specifies a trigger file whose presence should cause streaming replication to
# end (i.e., failover).
trigger_file = '/tmp/pg_failover_trigger'
Replication Set up Process
On the Master, we clear out all old WAL files by using pg_controldata
to find the current WAL file and delete anything older
On the Replication Server:
- Stop the database
- Unmount
pg_xlog
directory - Delete contents of data directory
- Run
pg_basebackup -Xs -U replicator -h #{SOURCE_DB} -D /var/lib/postgresql/9.6/main -v
- Copy
recovery.conf
into new data directory - Move the
pg_xlog
directory, create a new one and mount it, then copy the archive files over to the new directory and fix permissions - Start up the database
At this point, I've seen the startup process process a number of WAL files, but it always (and rather quickly -- like within a few minutes) stops and continually reports that it is recovering the same file indefinitely. Checking tables between the two databases confirms that the replication database hasn't added new records that are in the master database.
I get the feeling that I am missing something fundamental and would greatly appreciate questions and guidance.