On 5/3/21 7:42 AM, Thomas Munro wrote:
> On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That last point means that there was some hard-to-hit problem even
>> before any of the recent WAL-related changes. However, 323cbe7c7
>> (Remove read_page callback from XLogReader) increased the failure
>> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
>> referenced data) seems to have increased it by another factor of 4.
>> But it looks like f003d9f87 (Add circular WAL decoding buffer)
>> didn't materially change the failure rate.
>
> Oh, wow. There are several surprising results there. Thanks for
> running those tests for so long so that we could see the rarest
> failures.
>
> Even if there are somehow *two* causes of corruption, one preexisting
> and one added by the refactoring or decoding patches, I'm struggling
> to understand how the chance increases with 1d2575, since that only
> adds code that isn't reached when not enabled (though I'm going to
> re-review that).
>
>> Considering that 323cbe7c7 was supposed to be just refactoring,
>> and 1d257577e is allegedly disabled-by-default, these are surely
>> not the results I was expecting to get.
>
> +1
>
>> It seems like it's still an open question whether all this is
>> a real bug, or flaky hardware. I have seen occasional kernel
>> freezeups (or so I think -- machine stops responding to keyboard
>> or network input) over the past year or two, so I cannot in good
>> conscience rule out the flaky-hardware theory. But it doesn't
>> smell like that kind of problem to me. I think what we're looking
>> at is a timing-sensitive bug that was there before (maybe long
>> before?) and these commits happened to make it occur more often
>> on this particular hardware. This hardware is enough unlike
>> anything made in the past decade that it's not hard to credit
>> that it'd show a timing problem that nobody else can reproduce.
>
> Hmm, yeah that does seem plausible. It would be nice to see a report
> from any other system though. I'm still trying, and reviewing...
>
FWIW I've ran the test (make installcheck-parallel in a loop) on four
different machines - two x86_64 ones, and two rpi4. The x86 boxes did
~1000 rounds each (and one of them had 5 local replicas) without any
issue. The rpi4 machines did ~50 rounds each, also without failures.
Obviously, it's possible there's something that neither of those (very
different systems) triggers, but I'd say it might also be a hint that
this really is a hw issue on the old ppc macs. Or maybe something very
specific to that arch.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company