Home > mailing lists

Re: pg15b3: recovery fails with wal prefetch enabled - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: pg15b3: recovery fails with wal prefetch enabled
Date	September 5, 2022 01:28:12
Msg-id	CA+hUKGL=+0nF8o8xG5DDUepG0ZxgDXusF=Jqtd7FmtFvmR1Gmg@mail.gmail.com Whole thread Raw
In response to	Re: pg15b3: recovery fails with wal prefetch enabled (Thomas Munro <thomas.munro@gmail.com>)
Responses	Re: pg15b3: recovery fails with wal prefetch enabled Re: pg15b3: recovery fails with wal prefetch enabled
List	pgsql-hackers

Tree view

On Fri, Sep 2, 2022 at 6:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... The active ingredient here is a setting of
> maintenance_io_concurency=0, which runs into a dumb accounting problem
> of the fencepost variety and incorrectly concludes it's reached the
> end early.  Setting it to 3 or higher allows his system to complete
> recovery.  I'm working on a fix ASAP.

The short version is that when tracking the number of IOs in progress,
I had two steps in the wrong order in the algorithm for figuring out
whether IO is saturated.  Internally, the effect of
maintenance_io_concurrency is clamped to 2 or more, and that mostly
hides the bug until you try to replay a particular sequence like
Justin's with such a low setting.  Without that clamp, and if you set
it to 1, then several of our recovery tests fail.

That clamp was a bad idea.  What I think we really want is for
maintenance_io_concurrency=0 to disable recovery prefetching exactly
as if you'd set recovery_prefetch=off, and any other setting including
1 to work without clamping.

Here's the patch I'm currently testing.  It also fixes a related
dangling reference problem with very small maintenance_io_concurrency.

I had this more or less figured out on Friday when I wrote last, but I
got stuck on a weird problem with 026_overwrite_contrecord.pl.  I
think that failure case should report an error, no?  I find it strange
that we end recovery in silence.  That was a problem for the new
coding in this patch, because it is confused by XLREAD_FAIL without
queuing an error, and then retries, which clobbers the aborted recptr
state.  I'm still looking into that.

Attachment

0001-Fix-recovery_prefetch-with-low-maintenance_io_concur.patch

pgsql-hackers by date:

From: "Jonathan S. Katz"
Date: 05 September 2022, 00:50:33
Subject: Re: POC: GROUP BY optimization

From: Thomas Munro
Date: 05 September 2022, 02:28:48
Subject: Re: Postmaster self-deadlock due to PLT linkage resolution

Re: pg15b3: recovery fails with wal prefetch enabled - Mailing list pgsql-hackers

Attachment

Previous

Next