Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Tom Lane
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id 892344.1619054465@sss.pgh.pa.us
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Stephen Frost <sfrost@snowman.net>)
Responses Re: WIP: WAL prefetch (another approach)  (Andres Freund <andres@anarazel.de>)
Re: WIP: WAL prefetch (another approach)  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Stephen Frost <sfrost@snowman.net> writes:
> On Wed, Apr 21, 2021 at 19:17 Thomas Munro <thomas.munro@gmail.com> wrote:
>> ... Personally I think the right thing to do now is to revert it
>> and re-propose for 15 early in the cycle, supported with some better
>> testing infrastructure.

> I tend to agree with the idea to revert it, perhaps a +0 on that, but if
> others argue it should be fixed in-place, I wouldn’t complain about it.

FWIW, I've so far only been able to see problems on two old PPC Macs,
one of which has been known to be a bit flaky in the past.  So it's
possible that what I'm looking at is a hardware glitch.  But it's
consistent enough that I rather doubt that.

What I'm doing is running the core regression tests with a single
standby (on the same machine) and wal_consistency_checking = all.
Fairly reproducibly (more than one run in ten), what I get on the
slightly-flaky machine is consistency check failures like

2021-04-21 17:42:56.324 EDT [42286] PANIC:  inconsistent page found, rel 1663/354383/357033, forknum 0, blkno 9, byte
offset2069: replay 0x00 primary 0x03 
2021-04-21 17:42:56.324 EDT [42286] CONTEXT:  WAL redo at 24/121C97B0 for Heap/INSERT: off 107 flags 0x00; blkref #0:
rel1663/354383/357033, blk 9 FPW 
2021-04-21 17:45:11.662 EDT [42284] LOG:  startup process (PID 42286) was terminated by signal 6: Abort trap

2021-04-21 11:25:30.091 EDT [38891] PANIC:  inconsistent page found, rel 1663/229880/237980, forknum 0, blkno 108, byte
offset3845: replay 0x00 primary 0x99 
2021-04-21 11:25:30.091 EDT [38891] CONTEXT:  WAL redo at 17/A99897FC for SPGist/ADD_LEAF: add leaf to page; off 241;
headoff171; parentoff 0; blkref #0: rel 1663/229880/237980, blk 108 FPW 
2021-04-21 11:26:59.371 EDT [38889] LOG:  startup process (PID 38891) was terminated by signal 6: Abort trap

2021-04-20 19:20:16.114 EDT [34405] PANIC:  inconsistent page found, rel 1663/189216/197311, forknum 0, blkno 115, byte
offset6149: replay 0x37 primary 0x03 
2021-04-20 19:20:16.114 EDT [34405] CONTEXT:  WAL redo at 13/3CBFED00 for SPGist/ADD_LEAF: add leaf to page; off 241;
headoff171; parentoff 0; blkref #0: rel 1663/189216/197311, blk 115 FPW 
2021-04-20 19:21:54.421 EDT [34403] LOG:  startup process (PID 34405) was terminated by signal 6: Abort trap

2021-04-20 17:44:09.356 EDT [24106] FATAL:  inconsistent page found, rel 1663/135419/143843, forknum 0, blkno 101, byte
offset6152: replay 0x40 primary 0x00 
2021-04-20 17:44:09.356 EDT [24106] CONTEXT:  WAL redo at D/5107D8A8 for Gist/PAGE_UPDATE: ; blkref #0: rel
1663/135419/143843,blk 101 FPW 

(Note I modified checkXLogConsistency to PANIC on failure, so I could get
a core dump to analyze; and it's also printing the first-mismatch location.)

I have not analyzed each one of these failures exhaustively, but on the
ones I have looked at closely, the replay_image_masked version of the page
appears correct while the primary_image_masked version is *not*.
Moreover, the primary_image_masked version does not match the full-page
image that I see in the on-disk WAL file.  It did however seem to match
the in-memory WAL record contents that the decoder is operating on.
So unless you want to believe the buggy-hardware theory, something's
occasionally messing up while loading WAL records from disk.  All of the
trouble cases involve records that span across WAL pages (unsurprising
since they contain FPIs), so maybe there's something not quite right
in there.

In the cases that I looked at closely, it appeared that there was a
block of 32 wrong bytes somewhere within the page image, with the data
before and after that being correct.  I'm not sure if that pattern
holds in all cases though.

BTW, if I restart the failed standby, it plows through the same data
just fine, confirming that the on-disk WAL is not corrupt.

The other PPC machine (with no known history of trouble) is the one
that had the CRC failure I showed earlier.  That one does seem to be
actual bad data in the stored WAL, because the problem was also seen
by pg_waldump, and trying to restart the standby got the same failure
again.  I've not been able to duplicate the consistency-check failures
there.  But because that machine is a laptop with a much inferior disk
drive, the speeds are enough different that it's not real surprising
if it doesn't hit the same problem.

I've also tried to reproduce on 32-bit and 64-bit Intel, without
success.  So if this is real, maybe it's related to being big-endian
hardware?  But it's also quite sensitive to $dunno-what, maybe the
history of WAL records that have already been replayed.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: PATCH: Add GSSAPI ccache_name option to libpq
Next
From: Andres Freund
Date:
Subject: Re: WIP: WAL prefetch (another approach)