Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs
Date
Msg-id CAA4eK1+5qSt+dghvVWSzm06nDRoYCUtcfs-hAVStO=j_--W6hw@mail.gmail.com
Whole thread Raw
In response to bad wal on replica / incorrect resource manager data checksum inrecord / zfs  (Alex Malek <magicagent@gmail.com>)
List pgsql-hackers
On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent@gmail.com> wrote:
>
>
> Hello Postgres Hackers -
>
> We are having a reoccurring issue on 2 of our replicas where replication stops due to this message:
> "incorrect resource manager data checksum in record at ..."
> This has been occurring on average once every 1 to 2 weeks during large data imports (100s of GBs being written)
> on one of two replicas.
> Fixing the issue has been relatively straight forward: shutdown replica, remove the bad wal file, restart replica and
> the good wal file is retrieved from the master.
> We are doing streaming replication using replication slots.
> However twice now, the master had already removed the WAL file so the file had to retrieved from the wal archive.
>
> The WAL log directories on the master and the replicas are on ZFS file systems.
> All servers are running RHEL 7.7 (Maipo)
> PostgreSQL 10.11
> ZFS v0.7.13-1
>
> The issue seems similar to
https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.comand to
https://github.com/timescale/timescaledb/issues/1443
>
> One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS sees our array as a single device.
>
> Right before the issue started we did some upgrades and altered some postgres configs and ZFS settings.
> We have been slowly rolling back changes but so far the the issue continues.
>
> Some interesting data points while debugging:
> We had lowered the ZFS recordsize from 128K to 32K and for that week the issue started happening every other day.
> Using xxd and diff we compared "good" and "bad" wal files and the differences were not random bad bytes.
>
> The bad file either had a block of zeros that were not in the good file at that position or other data.  Occasionally
thebad data has contained legible strings not in the good file at that position. At least one of those exact strings
hasexisted elsewhere in the files. 
> However I am not sure if that is the case for all of them.
>
> This made me think that maybe there was an issue w/ wal file recycling and ZFS under heavy load, so we tried lowering
> min_wal_size in order to "discourage" wal file recycling but my understanding is a low value discourages recycling
butit will still 
> happen (unless setting wal_recycle in psql 12).
>

We do print a message "recycled write-ahead log file .." in DEBUG2
mode.  You either want to run the server with DEBUG2 or maybe change
the code to make it LOG and see if that is printed.  If you do that,
you can verify if the corrupted WAL is the same as a recycled one.

> There is a third replica where this bug has not (yet?) surfaced.
> This leads me to guess the bad data does not originate on the master.
> This replica is older than the other replicas, slower CPUs, less RAM, and the WAL disk array is spinning disks.
> The OS, version of Postgres, and version of ZFS are the same as the other replicas.
> This replica is not using a replication slot.
> This replica does not serve users so load/contention is much lower than the others.
> The other replicas often have 100% utilization of the disk array that houses the (non-wal) data.
>
> Any insight into the source of this bug or how to address it?
>
> Since the master has a good copy of the WAL file, can the replica re-request  the file from the master? Or from the
archive?
>

I think we do check in the archive if we get the error during
streaming, but archive might also have the same data due to which this
problem happens.  Have you checked that the archive WAL file, is it
different from the bad WAL?  See the relevant bits of code in
WaitForWALToBecomeAvailable especially the code near below comment:

"Failure while streaming. Most likely, we got here because streaming
replication was terminated, or promotion was triggered. But we also
get here if we find an invalid record in the WAL streamed from master,
in which case something is seriously wrong. There's little chance that
the problem will just go away, but PANIC is not good for availability
either, especially in hot standby mode. So, we treat that the same as
disconnection, and retry from archive/pg_wal again. The WAL in the
archive should be identical to what was streamed, so it's unlikely
that it helps, but one can hope..."


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Add PostgreSQL home page to --help output
Next
From: Etsuro Fujita
Date:
Subject: Minor improvement to partition_bounds_copy()