On Tue, Aug 15, 2023 at 6:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> I've been spending some extra time on this one and hacked a TAP test
> that reliably reproduces the original issue, using a message similar
> to what I mentioned in my previous messages.
Nice.
I hacked on this idea for quite a long time yesterday and today and
came up with a set of tests for the main end-of-WAL conditions:
▶ 1/1 - xl_tot_len zero OK
▶ 1/1 - xl_tot_len short OK
▶ 1/1 - xl_prev bad OK
▶ 1/1 - xl_crc bad OK
▶ 1/1 - xlp_magic zero OK
▶ 1/1 - xlp_magic bad OK
▶ 1/1 - xlp_pageaddr bad OK
▶ 1/1 - xlp_info bad OK
▶ 1/1 - xlp_info lacks XLP_FIRST_IS_CONTRECORD OK
▶ 1/1 - xlp_rem_len bad OK
▶ 1/1 - xlp_magic zero (split record header) OK
▶ 1/1 - xlp_pageaddr bad (split record header) OK
▶ 1/1 - xlp_rem_len bad (split record header) OK
1/1 postgresql:recovery / recovery/038_end_of_wal OK
5.79s 13 subtests passed
It took me a while to come up with a workable way to get into the
record-header-splitting zone. Based on some of your clues about
flushing, I eventually realised I needed transactional messages, and I
built a kind of self-calibrating Rube Goldberg function around that.
It's terrible, and I'm sure we can do better.
I wonder what people think about putting internal details of the WAL
format into a Perl test like this. Obviously it requires maintenance,
since it knows the size and layout of a few things. I guess it'd be
allowed to fish a couple of those numbers out of the source.
Work in progress... I'm sure more useful checks could be added. One
thing that occurred to me while thinking about all this it that the
'treat malloc failure as end of WAL' thing you highlighted in another
thread is indeed completely bananas -- I didn't go digging, but
perhaps it was an earlier solution to the very same garbage xl_tot_len
problem, before 70b4f82a4b5 and now this xl_rem_len-based solution?