Re: 039_end_of_wal: error in "xl_tot_len zero" test - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: 039_end_of_wal: error in "xl_tot_len zero" test |
Date | |
Msg-id | CA+hUKGL47fv64v6kXyz5t69yQdhp2KKFe4eR63Sk7Aitzxv=_A@mail.gmail.com Whole thread Raw |
In response to | 039_end_of_wal: error in "xl_tot_len zero" test (Anton Voloshin <a.voloshin@postgrespro.ru>) |
Responses |
Re: 039_end_of_wal: error in "xl_tot_len zero" test
Re: 039_end_of_wal: error in "xl_tot_len zero" test |
List | pgsql-hackers |
On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin <a.voloshin@postgrespro.ru> wrote: > I believe there is a small problem in the 039_end_of_wal.pl's > "xl_tot_len zero" test. It supposes that after immediate shutdown the > server, upon startup recovery, should always produce a message matching > "invalid record length at .*: wanted 24, got 0". However, if the > circumstances are just right and we happened to hit exactly on the edge > of WAL page, then the message on startup recovery would be "invalid > magic number 0000 in log segment .*, offset .*". The test does not take > that into account. Hi Anton, Thanks for figuring this out! Right, I see. I will look more closely when I'm back from summer vacation in a few days, but first reaction: > Now, reproducing this is somewhat tricky, because exact position in WAL > at the test time depends on what exactly initdb did, and that not only > differs in different major verisons, but also depends on e.g. username > length, locales available, and, perhaps, more. Even though originally > this problem was found "in the wild" on one particular system on one > particular code branch, I've written small helper patch to make > reproduction on master easier, see > 0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch. > > This patch adds single emit_message of (hopefully) the right size to > make sure we hit end of WAL block right by the time we call > $node->stop('immediate') in "xl_tot_len zero" test. With this patch, > "xl_tot_len zero" test fails every time because the server writes > "invalid magic number 0000 in log segment" while the test still only > expects "invalid record length at .*: wanted 24, got 0". If course, this > 0001 patch is *not* meant to be committed, but only as an issue > reproduction helper. > > I can think of two possible fixes: > > 1. Update advance_out_of_record_splitting_zone to also avoid stopping at > exactly the block end: > > my $page_offset = $end_lsn % $WAL_BLOCK_SIZE; > - while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold) > + while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || > $page_offset <= $SizeOfXLogShortPHD) > { > see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch > > We need to compare with $SizeOfXLogShortPHD (and not with zero) because > at that point, even though we didn't actually write out new WAL page > yet, it's header is already in place in memory and taken in account > for LSN reporting. I like the fact that this preserves the same end-of-WAL case that we're trying to test. I don't yet have an opinion on the best way to do it though. Would it be enough to add emit_message($node, 0) after advance_out_of_record_splitting_zone()? The thing about this one specific test that is different from the later ones is that it doesn't actually write a record header at all, it was relying purely on pre-existing trailing zeroes, but it assumed the page header would be valid. As you figured out, that isn't true if we were right on the page boundary. Perhaps advance_out_of_record_splitting_zone() followed by emit_message(0) would make that always true, even then? > 2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic > number 0000 in WAL segment" message as well: > > $node->start; > ok( $node->log_contains( > + "invalid magic number 0000 in WAL segment|" . > "invalid record length at .*: expected at least 24, got 0", > $log_size > ), > see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch Tolerating the two different messages would weaken the test. > I think it makes sense to backport whatever the final change would be to > all branches with 039_end_of_wal (REL_12+). +1
pgsql-hackers by date: