Re: BUG #17928: Standby fails to decode WAL on termination of primary - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date
Msg-id ZNXQWvnmt1Fpt6vu@paquier.xyz
Whole thread Raw
In response to Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Noah Misch <noah@leadboat.com>)
Responses Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Noah Misch <noah@leadboat.com>)
List pgsql-bugs
On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:
> On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
>> Good idea to pollute the data with recycled segments.  Using a minimal
>> WAL segment size whould help here as well in keeping a test cheap, and
>> two segments should be enough.  The alignment calculations and the
>> header size can be known, but the standby records are an issue for the
>> predictability of the test when it comes to adjust the length of the
>> logical message depending on the 8k WAL page, no?
>
> Could be.  I expect there would be challenges translating that outline into a
> real test, but I don't know if that will be a major one.  The test doesn't
> need to be 100% deterministic.  If it fails 25% of the time and is not the
> slowest test in the recovery suite, I'd find that good enough.

FWIW, I'm having a pretty hard time to get something close enough to a
page border in a reliable way.  Perhaps using a larger series of
records and select only one would be more reliable..  Need to test
that a bit more.

>> FWIW, I came back to this thread while tweaking the error reporting of
>> xlogreader.c for the sake of this thread and this proposal is an
>> improvement to be able to make a distinction between an OOM and an
>> incorrect record:
>> https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz
>>
>> Anyway, agreed that it's an improvement to remove this check out of
>> allocate_recordbuf().  Noah, are you planning to work more on that?
>
> I can push xl_tot_len-validate-v1.patch, particularly given the testing result
> you reported today.  I'm content for my part to stop there.

Okay, fine by me.  That's going to help with what I am doing in the
other thread as I'd need to make a better difference between the OOM
and the invalid cases for the allocation path.

You are planning for a backpatch to take care of the inconsistency,
right?  The report has been on 15~ where the prefetching was
introduced.  I'd be OK to just do that and not mess up with the stable
branches more than necessary (aka ~14) if nobody complains, especially
REL_11_STABLE planned to be EOL'd in the next minor cycle.
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Noah Misch
Date:
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Next
From: Noah Misch
Date:
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary