Re: 039_end_of_wal: error in "xl_tot_len zero" test - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: 039_end_of_wal: error in "xl_tot_len zero" test
Date
Msg-id CA+hUKGL47fv64v6kXyz5t69yQdhp2KKFe4eR63Sk7Aitzxv=_A@mail.gmail.com
Whole thread Raw
In response to 039_end_of_wal: error in "xl_tot_len zero" test  (Anton Voloshin <a.voloshin@postgrespro.ru>)
Responses Re: 039_end_of_wal: error in "xl_tot_len zero" test
Re: 039_end_of_wal: error in "xl_tot_len zero" test
List pgsql-hackers
On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin
<a.voloshin@postgrespro.ru> wrote:
> I believe there is a small problem in the 039_end_of_wal.pl's
> "xl_tot_len zero" test. It supposes that after immediate shutdown the
> server, upon startup recovery, should always produce a message matching
> "invalid record length at .*: wanted 24, got 0". However, if the
> circumstances are just right and we happened to hit exactly on the edge
> of WAL page, then the message on startup recovery would be "invalid
> magic number 0000 in log segment .*, offset .*". The test does not take
> that into account.

Hi Anton,

Thanks for figuring this out!  Right, I see.  I will look more closely
when I'm back from summer vacation in a few days, but first reaction:

> Now, reproducing this is somewhat tricky, because exact position in WAL
> at the test time depends on what exactly initdb did, and that not only
> differs in different major verisons, but also depends on e.g. username
> length, locales available, and, perhaps, more. Even though originally
> this problem was found "in the wild" on one particular system on one
> particular code branch, I've written small helper patch to make
> reproduction on master easier, see
> 0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch.
>
> This patch adds single emit_message of (hopefully) the right size to
> make sure we hit end of WAL block right by the time we call
> $node->stop('immediate') in "xl_tot_len zero" test. With this patch,
> "xl_tot_len zero" test fails every time because the server writes
> "invalid magic number 0000 in log segment" while the test still only
> expects "invalid record length at .*: wanted 24, got 0". If course, this
> 0001 patch is *not* meant to be committed, but only as an issue
> reproduction helper.
>
> I can think of two possible fixes:
>
> 1. Update advance_out_of_record_splitting_zone to also avoid stopping at
>     exactly the block end:
>
>       my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
> -    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
> +    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold ||
> $page_offset <= $SizeOfXLogShortPHD)
>       {
> see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch
>
> We need to compare with $SizeOfXLogShortPHD (and not with zero) because
> at that point, even though we didn't actually write out new WAL page
> yet, it's header is already in place in memory and taken in account
> for LSN reporting.

I like the fact that this preserves the same end-of-WAL case that
we're trying to test.  I don't yet have an opinion on the best way to
do it though.  Would it be enough to add emit_message($node, 0) after
advance_out_of_record_splitting_zone()?  The thing about this one
specific test that is different from the later ones is that it doesn't
actually write a record header at all, it was relying purely on
pre-existing trailing zeroes, but it assumed the page header would be
valid.  As you figured out, that isn't true if we were right on the
page boundary.  Perhaps advance_out_of_record_splitting_zone()
followed by emit_message(0) would make that always true, even then?

> 2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic
>     number 0000 in WAL segment" message as well:
>
>   $node->start;
>   ok( $node->log_contains(
> +        "invalid magic number 0000 in WAL segment|" .
>           "invalid record length at .*: expected at least 24, got 0",
> $log_size
>       ),
> see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch

Tolerating the two different messages would weaken the test.

> I think it makes sense to backport whatever the final change would be to
> all branches with 039_end_of_wal (REL_12+).

+1



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: Built-in CTYPE provider
Next
From: Peter Smith
Date:
Subject: Re: minor replication slot docs edits