Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs - Mailing list pgsql-hackers

From Alex Malek
Subject Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs
Date
Msg-id CAGH8ccfa3fPoT0TizkrQ3Z4gz5XJi+pSBqN8CHUAHmqWEcf0zA@mail.gmail.com
Whole thread Raw
In response to Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs  (Alex Malek <magicagent@gmail.com>)
List pgsql-hackers
On Wed, Feb 19, 2020 at 4:35 PM Alex Malek <magicagent@gmail.com> wrote:

Hello Postgres Hackers -

We are having a reoccurring issue on 2 of our replicas where replication stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica, remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file had to retrieved from the wal archive.

The WAL log directories on the master and the replicas are on ZFS file systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1


One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS sees our array as a single device.
....
<snip>


An update in case someone else encounters the same issue.

About 5 weeks ago, on the master database server, we turned off ZFS compression for the volume where the WAL log resides.
The error has not occurred on any replica since.

Best,
Alex

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Should we add xid_current() or a int8->xid cast?
Next
From: Andres Freund
Date:
Subject: Re: Proposal: Expose oldest xmin as SQL function for monitoring