Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id f2be6caa-5a7a-990b-c56e-a29454ae1cee@enterprisedb.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: WIP: WAL prefetch (another approach)
List pgsql-hackers

On 5/3/21 7:42 AM, Thomas Munro wrote:
> On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That last point means that there was some hard-to-hit problem even
>> before any of the recent WAL-related changes.  However, 323cbe7c7
>> (Remove read_page callback from XLogReader) increased the failure
>> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
>> referenced data) seems to have increased it by another factor of 4.
>> But it looks like f003d9f87 (Add circular WAL decoding buffer)
>> didn't materially change the failure rate.
> 
> Oh, wow.  There are several surprising results there.  Thanks for
> running those tests for so long so that we could see the rarest
> failures.
> 
> Even if there are somehow *two* causes of corruption, one preexisting
> and one added by the refactoring or decoding patches, I'm struggling
> to understand how the chance increases with 1d2575, since that only
> adds code that isn't reached when not enabled (though I'm going to
> re-review that).
> 
>> Considering that 323cbe7c7 was supposed to be just refactoring,
>> and 1d257577e is allegedly disabled-by-default, these are surely
>> not the results I was expecting to get.
> 
> +1
> 
>> It seems like it's still an open question whether all this is
>> a real bug, or flaky hardware.  I have seen occasional kernel
>> freezeups (or so I think -- machine stops responding to keyboard
>> or network input) over the past year or two, so I cannot in good
>> conscience rule out the flaky-hardware theory.  But it doesn't
>> smell like that kind of problem to me.  I think what we're looking
>> at is a timing-sensitive bug that was there before (maybe long
>> before?) and these commits happened to make it occur more often
>> on this particular hardware.  This hardware is enough unlike
>> anything made in the past decade that it's not hard to credit
>> that it'd show a timing problem that nobody else can reproduce.
> 
> Hmm, yeah that does seem plausible.  It would be nice to see a report
> from any other system though.  I'm still trying, and reviewing...
> 

FWIW I've ran the test (make installcheck-parallel in a loop) on four 
different machines - two x86_64 ones, and two rpi4. The x86 boxes did 
~1000 rounds each (and one of them had 5 local replicas) without any 
issue. The rpi4 machines did ~50 rounds each, also without failures.

Obviously, it's possible there's something that neither of those (very 
different systems) triggers, but I'd say it might also be a hint that 
this really is a hw issue on the old ppc macs. Or maybe something very 
specific to that arch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Race condition in recovery?
Next
From: vignesh C
Date:
Subject: Re: Identify missing publications from publisher while create/alter subscription.