Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: WAL prefetch (another approach)
Date
Msg-id CA+hUKGLGKnzTDhW9rnzyYb7yvUiCoGK5pB74qQf3YYLZYSX4OA@mail.gmail.com
Whole thread Raw
In response to Re: WIP: WAL prefetch (another approach)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: WIP: WAL prefetch (another approach)  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On Sun, May 2, 2021 at 3:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That last point means that there was some hard-to-hit problem even
> before any of the recent WAL-related changes.  However, 323cbe7c7
> (Remove read_page callback from XLogReader) increased the failure
> rate by at least a factor of 5, and 1d257577e (Optionally prefetch
> referenced data) seems to have increased it by another factor of 4.
> But it looks like f003d9f87 (Add circular WAL decoding buffer)
> didn't materially change the failure rate.

Oh, wow.  There are several surprising results there.  Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).

> Considering that 323cbe7c7 was supposed to be just refactoring,
> and 1d257577e is allegedly disabled-by-default, these are surely
> not the results I was expecting to get.

+1

> It seems like it's still an open question whether all this is
> a real bug, or flaky hardware.  I have seen occasional kernel
> freezeups (or so I think -- machine stops responding to keyboard
> or network input) over the past year or two, so I cannot in good
> conscience rule out the flaky-hardware theory.  But it doesn't
> smell like that kind of problem to me.  I think what we're looking
> at is a timing-sensitive bug that was there before (maybe long
> before?) and these commits happened to make it occur more often
> on this particular hardware.  This hardware is enough unlike
> anything made in the past decade that it's not hard to credit
> that it'd show a timing problem that nobody else can reproduce.

Hmm, yeah that does seem plausible.  It would be nice to see a report
from any other system though.  I'm still trying, and reviewing...



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Identify missing publications from publisher while create/alter subscription.
Next
From: Noah Misch
Date:
Subject: Re: Dump public schema ownership & seclabels