Re: BUG #17928: Standby fails to decode WAL on termination of primary - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date
Msg-id CA+hUKGLhFdd-G1DCk9Ze3KnQ_2jUkxyXvGwV_Ha=DWMrDnGHng@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17928: Standby fails to decode WAL on termination of primary  (Michael Paquier <michael@paquier.xyz>)
Responses Re: BUG #17928: Standby fails to decode WAL on termination of primary
List pgsql-bugs
On Tue, Aug 15, 2023 at 6:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> I've been spending some extra time on this one and hacked a TAP test
> that reliably reproduces the original issue, using a message similar
> to what I mentioned in my previous messages.

Nice.

I hacked on this idea for quite a long time yesterday and today and
came up with a set of tests for the main end-of-WAL conditions:

▶ 1/1 - xl_tot_len zero                                  OK
▶ 1/1 - xl_tot_len short                                 OK
▶ 1/1 - xl_prev bad                                      OK
▶ 1/1 - xl_crc bad                                       OK
▶ 1/1 - xlp_magic zero                                   OK
▶ 1/1 - xlp_magic bad                                    OK
▶ 1/1 - xlp_pageaddr bad                                 OK
▶ 1/1 - xlp_info bad                                     OK
▶ 1/1 - xlp_info lacks XLP_FIRST_IS_CONTRECORD           OK
▶ 1/1 - xlp_rem_len bad                                  OK
▶ 1/1 - xlp_magic zero (split record header)             OK
▶ 1/1 - xlp_pageaddr bad (split record header)           OK
▶ 1/1 - xlp_rem_len bad (split record header)            OK
1/1 postgresql:recovery / recovery/038_end_of_wal        OK
    5.79s   13 subtests passed

It took me a while to come up with a workable way to get into the
record-header-splitting zone.  Based on some of your clues about
flushing, I eventually realised I needed transactional messages, and I
built a kind of self-calibrating Rube Goldberg function around that.
It's terrible, and I'm sure we can do better.

I wonder what people think about putting internal details of the WAL
format into a Perl test like this.  Obviously it requires maintenance,
since it knows the size and layout of a few things.  I guess it'd be
allowed to fish a couple of those numbers out of the source.

Work in progress...  I'm sure more useful checks could be added.  One
thing that occurred to me while thinking about all this it that the
'treat malloc failure as end of WAL' thing you highlighted in another
thread is indeed completely bananas -- I didn't go digging, but
perhaps it was an earlier solution to the very same garbage xl_tot_len
problem, before 70b4f82a4b5 and now this xl_rem_len-based solution?

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #18057: unaccent removes intentional spaces
Next
From: Michael Paquier
Date:
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary