Re: pgsql: Improve runtime and output of tests for replication slots checkp - Mailing list pgsql-committers

From Tom Lane
Subject Re: pgsql: Improve runtime and output of tests for replication slots checkp
Date
Msg-id 2852529.1750553810@sss.pgh.pa.us
Whole thread Raw
In response to Re: pgsql: Improve runtime and output of tests for replication slots checkp  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: pgsql: Improve runtime and output of tests for replication slots checkp
List pgsql-committers
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Sat, Jun 21, 2025 at 2:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> But "Wait for the next page to become available" seems awfully
>> trusting that there will be another page.  Should this be
>> using the no-wait code path?

> Thank you for the help.  It seems to me that problem is deeper.  The
> code seems to only trying to read till the end of given WAL record,
> but can't reach it.  According to the values I've seen in XLogCtl, it
> seems that RedoRecPtr points somewhere inside of that record's body.
> I don't feel confident about to understand what's going on and how to
> fix it.

Hmm.  My theory about what's happening is that we are writing a WAL
record that spans across a page boundary, and the asynchronous
immediate-stop request comes in and kills that operation, so that
the first half of the record is on disk but the second is not.
(This outcome would obviously be very timing-dependent.)  Then
we restart the server, and crash recovery copes with that situation
successfully.  But pg_logical_slot_get_changes() fails to, because
it's coded in such a way that it will wait forever for the second
half of the WAL record to appear.

The main hole in this theory is: if crash recovery thought that the
partial record was invalid, wouldn't it set the end-of-WAL position
to the start of that record?  And shouldn't that stop
pg_logical_slot_get_changes() from trying to process the partial
record?  So I'm sure there's some details not quite right about
this picture.  But maybe it's within hailing distance of the truth.

> The new tests just spotted the existing bug.

Yeah, that's what I think too.  The unintentional omission of a
pre-shutdown delay in the 046 test has exposed some pre-existing
fragility in pg_logical_slot_get_changes().  So I'm not in favor
of changing 046 till we understand this better.

            regards, tom lane



pgsql-committers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: pgsql: Improve runtime and output of tests for replication slots checkp
Next
From: Peter Eisentraut
Date:
Subject: pgsql: meson: Fix meson warning