I am monitoring our postgreSQL slave server's replication lag using the following python script, which queries the pg_current_xlog_location
from the master, and compares it to the pg_xlog_replay_location
on the slave. Lately, I have been receiving warning emails indicating that the replication lag is between 2k and 70k bytes.
What would be a reasonable expectation here? I assume it is based on the WAL buffer size and checkpoint interval, but I am not sure exactly how to calculate it. Also, would I be better off comparing to pg_xlog_receive_location
on the slave?
P.S. I am also monitoring replication on the master server by comparing sent_location
to replay_location
in the pg_stat_replication
view. Additionally, I check that the master server is in streaming
mode. That monitor has never fired an alert...
#!/usr/bin/python
import subprocess
slaveXlogDiffLimitBytes = 128
try:
repModeRes = subprocess.check_output('psql -t -p {{postgresql_port}} -c "SELECT pg_is_in_recovery()"', shell=True)
isInRepMode = repModeRes.strip() == 't'
masterXlogLocationRes = subprocess.check_output('psql -t -p {{postgresql_port}} -h {{postgres_basebackup_host}} -U {{postgres_basebackup_user}} {{postgres_db_name}} -c "select pg_current_xlog_location();"', shell=True)
masterXlogLocationStr = masterXlogLocationRes.strip()
slaveXlogDiffRes = subprocess.check_output('psql -t -p {{postgresql_port}} {{postgres_db_name}} -c "select pg_xlog_location_diff(pg_last_xlog_replay_location(), \'' + masterXlogLocationStr + '\'::pg_lsn);"', shell=True)
slaveXlogDiffBytes = float(slaveXlogDiffRes.strip())
except subprocess.CalledProcessError as e:
print "Error retrieving stats: {0}".format(e)
exit(1)
if isInRepMode != True:
print ('Slave server is not in recovery mode')
exit(1)
if slaveXlogDiffBytes > slaveXlogDiffLimitBytes:
print "Slave server replication is behind master by %f bytes" % slaveXlogDiffBytes
exit(1)
print('All clear!')
exit(0)