039_end_of_wal: error in "xl_tot_len zero" test - Mailing list pgsql-hackers

From Anton Voloshin
Subject 039_end_of_wal: error in "xl_tot_len zero" test
Date
Msg-id b26aeac2-cb6d-4633-a7ea-945baae83dcf@postgrespro.ru
Whole thread Raw
Responses Re: 039_end_of_wal: error in "xl_tot_len zero" test
List pgsql-hackers
Hello, hackers,

I believe there is a small problem in the 039_end_of_wal.pl's 
"xl_tot_len zero" test. It supposes that after immediate shutdown the 
server, upon startup recovery, should always produce a message matching 
"invalid record length at .*: wanted 24, got 0". However, if the 
circumstances are just right and we happened to hit exactly on the edge 
of WAL page, then the message on startup recovery would be "invalid 
magic number 0000 in log segment .*, offset .*". The test does not take 
that into account.

Now, reproducing this is somewhat tricky, because exact position in WAL 
at the test time depends on what exactly initdb did, and that not only 
differs in different major verisons, but also depends on e.g. username 
length, locales available, and, perhaps, more. Even though originally 
this problem was found "in the wild" on one particular system on one 
particular code branch, I've written small helper patch to make 
reproduction on master easier, see 
0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch.

This patch adds single emit_message of (hopefully) the right size to 
make sure we hit end of WAL block right by the time we call 
$node->stop('immediate') in "xl_tot_len zero" test. With this patch, 
"xl_tot_len zero" test fails every time because the server writes 
"invalid magic number 0000 in log segment" while the test still only 
expects "invalid record length at .*: wanted 24, got 0". If course, this 
0001 patch is *not* meant to be committed, but only as an issue 
reproduction helper.

I can think of two possible fixes:

1. Update advance_out_of_record_splitting_zone to also avoid stopping at
    exactly the block end:

      my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
+    while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || 
$page_offset <= $SizeOfXLogShortPHD)
      {
see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch

We need to compare with $SizeOfXLogShortPHD (and not with zero) because 
at that point, even though we didn't actually write out new WAL page
yet, it's header is already in place in memory and taken in account
for LSN reporting.

2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic
    number 0000 in WAL segment" message as well:

  $node->start;
  ok( $node->log_contains(
+        "invalid magic number 0000 in WAL segment|" .
          "invalid record length at .*: expected at least 24, got 0", 
$log_size
      ),
see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch

I think it makes sense to backport whatever the final change would be to 
all branches with 039_end_of_wal (REL_12+).

Any thoughts?

Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru
Attachment

pgsql-hackers by date:

Previous
From: torikoshia
Date:
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Next
From: Peter Eisentraut
Date:
Subject: Re: Built-in CTYPE provider