Home > mailing lists

Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs - Mailing list pgsql-hackers

From	Alex Malek
Subject	Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs
Date	April 2, 2020 20:44:57
Msg-id	CAGH8ccfa3fPoT0TizkrQ3Z4gz5XJi+pSBqN8CHUAHmqWEcf0zA@mail.gmail.com Whole thread Raw
In response to	Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs (Alex Malek <magicagent@gmail.com>)
List	pgsql-hackers

Tree view

On Wed, Feb 19, 2020 at 4:35 PM Alex Malek <magicagent@gmail.com> wrote:

Hello Postgres Hackers -

We are having a reoccurring issue on 2 of our replicas where replication stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica, remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file had to retrieved from the wal archive.

The WAL log directories on the master and the replicas are on ZFS file systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1

The issue seems similar to https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com and to https://github.com/timescale/timescaledb/issues/1443

One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS sees our array as a single device.
....
<snip>

An update in case someone else encounters the same issue.

About 5 weeks ago, on the master database server, we turned off ZFS compression for the volume where the WAL log resides.

The error has not occurred on any replica since.

Best,

Alex

pgsql-hackers by date:

From: Alvaro Herrera
Date: 02 April 2020, 20:33:18
Subject: Re: Should we add xid_current() or a int8->xid cast?

From: Andres Freund
Date: 02 April 2020, 20:50:28
Subject: Re: Proposal: Expose oldest xmin as SQL function for monitoring

Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs - Mailing list pgsql-hackers

Previous

Next