Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | f2be6caa-5a7a-990b-c56e-a29454ae1cee@enterprisedb.com Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: WIP: WAL prefetch (another approach)
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-hackers |
On 5/3/21 7:42 AM, Thomas Munro wrote: > On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That last point means that there was some hard-to-hit problem even >> before any of the recent WAL-related changes. However, 323cbe7c7 >> (Remove read_page callback from XLogReader) increased the failure >> rate by at least a factor of 5, and 1d257577e (Optionally prefetch >> referenced data) seems to have increased it by another factor of 4. >> But it looks like f003d9f87 (Add circular WAL decoding buffer) >> didn't materially change the failure rate. > > Oh, wow. There are several surprising results there. Thanks for > running those tests for so long so that we could see the rarest > failures. > > Even if there are somehow *two* causes of corruption, one preexisting > and one added by the refactoring or decoding patches, I'm struggling > to understand how the chance increases with 1d2575, since that only > adds code that isn't reached when not enabled (though I'm going to > re-review that). > >> Considering that 323cbe7c7 was supposed to be just refactoring, >> and 1d257577e is allegedly disabled-by-default, these are surely >> not the results I was expecting to get. > > +1 > >> It seems like it's still an open question whether all this is >> a real bug, or flaky hardware. I have seen occasional kernel >> freezeups (or so I think -- machine stops responding to keyboard >> or network input) over the past year or two, so I cannot in good >> conscience rule out the flaky-hardware theory. But it doesn't >> smell like that kind of problem to me. I think what we're looking >> at is a timing-sensitive bug that was there before (maybe long >> before?) and these commits happened to make it occur more often >> on this particular hardware. This hardware is enough unlike >> anything made in the past decade that it's not hard to credit >> that it'd show a timing problem that nobody else can reproduce. > > Hmm, yeah that does seem plausible. It would be nice to see a report > from any other system though. I'm still trying, and reviewing... > FWIW I've ran the test (make installcheck-parallel in a loop) on four different machines - two x86_64 ones, and two rpi4. The x86 boxes did ~1000 rounds each (and one of them had 5 local replicas) without any issue. The rpi4 machines did ~50 rounds each, also without failures. Obviously, it's possible there's something that neither of those (very different systems) triggers, but I'd say it might also be a hint that this really is a hw issue on the old ppc macs. Or maybe something very specific to that arch. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: