Re: incorrect resource manager data checksum in record - Mailing list pgsql-general

From Thomas Munro
Subject Re: incorrect resource manager data checksum in record
Date
Msg-id CAEepm=0wBTZeagm6xKaJknJ2_qPHAqvLmS8mkuMx6ixT=MTgbg@mail.gmail.com
Whole thread Raw
In response to Re: incorrect resource manager data checksum in record  (Devin Christensen <quixoten@gmail.com>)
List pgsql-general
On Fri, Jun 29, 2018 at 1:14 PM, Devin Christensen <quixoten@gmail.com> wrote:
>> From your description it sounds like it's happening in the middle of
>> streaming, right?
>
> Correct. None of the instances in the chain experience a crash. Most of the
> time I see the "incorrect resource manager data checksum in record" error,
> but I've also seen it manifested as:
>
> invalid magic number 8813 in log segment 000000030000AEC20000009C, offset
> 15335424

I note that that isn't at a segment boundary.  Is that also the case
for the other error?

One theory would be that there is a subtle FS cache coherency problem
between writes and reads of a file from different processes
(causality), on that particular stack.  Maybe not too many programs
pass data through files with IPC to signal progress in this kinda
funky way, but that'd certainly be a violation of POSIX if it didn't
work correctly and I think people would know about that so I feel a
bit silly suggesting it.  To follow that hypothesis to the next step:
I suppose it succeeds after you restart because it requests the whole
segment again and gets a coherent copy all the way down the chain.
Another idea would be that our flush pointer tracking and IPC is
somehow subtly wrong and that's exposed by different timing leading to
incoherent reads, but I feel like we would know about that by now too.

I'm not really a replication expert, so I could be missing something
simple here.  Anyone?

>> I did find this similar complaint that involves an ext4 primary and a
>> btrfs replica:
>
> It is interesting that my issue occurs on the first hop from ZFS to ext4. I
> have not seen any instances of this happening going from the ext4 primary to
> the first ZFS replica.

I happen to have a little office server that uses ZFS so I left it
chugging through a massive pgbench session with a chain of 3 replicas
while I worked on other stuff, and didn't see any problems (no ext4
involved though, this is a FreeBSD box).  I also tried
--wal-segsize=1MB (a feature of 11) to get some more frequent
recycling happening just in case it was relevant.

>> We did have a report recently of ZFS recycling WAL files very slowly
>
> Do you know what version of ZFS that effected? We're currently on 0.6.5.6,
> but could upgrade to 0.7.5 on Ubuntu 18.04

I think that issue is fundamental/all versions, and has something to
with the record size (if you have 128KB ZFS records and someone writes
8KB, it probably needs to read a whole 128KB record in, whereas with
ext4 et al you have 4KB blocks and the OS can very often skip reading
it in because it can see you're entirely overwriting blocks), and
possibly the COW design too (I dunno).  Here's the recent thread,
which points back to an older one, from some Joyent guys who I gather
are heavy ZFS users:

https://www.postgresql.org/message-id/flat/CACPQ5FpEY9CfUF6XKs5sBBuaOoGEiO8KD4SuX06wa4ATsesaqg%40mail.gmail.com

There was a ZoL bug that made headlines recently but that was in 0.7.7
so not relevant to your case.

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-general by date:

Previous
From: Devin Christensen
Date:
Subject: Re: incorrect resource manager data checksum in record
Next
From: Brent Wood
Date:
Subject: pgloader question - postgis support