Quantcast
Channel: StackExchange Replication Questions
Viewing all articles
Browse latest Browse all 17268

incorrect resource manager data checksum in record at 2/XYZ + terminating walreceiver process due to administrator command

$
0
0

I am running a streaming replication environment with PostgreSQL 9.1 (1 master, 3 slaves). Everything worked fine for aprox. 2 months. Yesterday, the replication to one of the slaves failed with the log on the slave having:

LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
FATAL:  terminating walreceiver process due to administrator command
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7

The slave was no longer in sync with the master. Two hours later, in which the log gets a new line like above every 5 seconds, I restarted the slave database server:

LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
FATAL:  terminating connection due to administrator command
FATAL:  terminating connection due to administrator command
LOG:  shutting down
LOG:  database system is shut down

The new log file on the slave contains:

LOG:  database system was shut down in recovery at 2016-02-29 05:12:11 CET
LOG:  entering standby mode
LOG:  redo starts at 61/D92C10C9
LOG:  consistent recovery state reached at  61/DA2710A7
LOG:  database system is ready to accept read only connections
LOG:  incorrect resource manager data checksum in record at 61/DA2710A7
LOG:  streaming replication successfully connected to primary

Now the slave is in sync with the master but the checksum entry is still there. One more thing I checked were the network logs -> the network was available.

My questions are:

  1. Does anyone know why the walreceiver was terminated?
  2. Why didn't PostgreSQL retry the replication?
  3. What can I do to prevent this in the future?

Thank you.

EDIT:

The database servers are running on SLES 11 with ext3. I found an article about low performance of SLES 11 with large RAM but I am not sure if it applies since my machine has only 8 GB RAM (https://www.novell.com/support/kb/doc.php?id=7010287)

Any help would be appreciated.

EDIT (2):

PostgreSQL version is 9.1.5. Seem that PostgreSQL version 9.1.6 provides a fix for similar issue?

Fix persistence marking of shared buffers during WAL replay (Jeff Davis)

This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.

Source: http://www.postgresql.org/docs/9.1/static/release-9-1-6.html

Might this be the fix? Should I upgrade to PostgreSQL 9.1.6 and everything would run smooth?


Viewing all articles
Browse latest Browse all 17268

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>