Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman - Mailing list pgsql-hackers

From Tom Lane
Subject Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Date
Msg-id 4015413.1649454951@sss.pgh.pa.us
Whole thread Raw
In response to Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Andres Freund <andres@anarazel.de>)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> On 2022-04-07 13:57:45 -0400, Tom Lane wrote:
>> Yeah, with only one instance it could just be cosmic rays or something.
>> However, assuming it is real, I guess I wonder why we don't say
>> CHECKPOINT_FORCE in standby mode too.

> I guess it might partially be that restartpoints require a checkpoint to have
> happened on the primary. If we used FORCE, we'd have to wait till the next
> checkpoint on the primary, which'd be a problem if it's e.g. a manually issued
> CHECKPOINT; before shutting the standby down.

After seeing skink's results, I tried running that test under valgrind
here, and it fails just like that every time.  skink's history allows
us to bound the failure introduction between 79b716cfb7 and
d7ab2a9a3c, which I think makes it just about certain that it was
5dc0418fab (Prefetch data referenced by the WAL, take II), though I've
not bisected to be 100% sure.

Adding some debug printouts to ExecuteRecoveryCommand convinces me
that indeed the archive_cleanup_command is NOT getting called by the
problematic CHECKPOINT command.  I surmise based on Andres' comment
above that the standby isn't making a restartpoint for lack of
an available primary checkpoint, which looks to me like it could be
a pre-existing bug in the test case: it's sure not doing anything to
guarantee that the primary's checkpoint record has reached the standby.

I tried adjusting the patch so it does guarantee that (as attached),
and in two out of two tries it got past the archive_cleanup_command
failure but then hung up waiting for standby2 to promote.

On the whole, I'm not sure that the WAL prefetch logic is noticeably
more stable than when we booted it out last year :-(.  However, I also
wonder why it is that this test case wasn't occasionally failing already.

            regards, tom lane

diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index c8f5ffbaf0..1032d8a388 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -44,13 +44,14 @@ $node_standby->start;
 # Create some content on primary
 $node_primary->safe_psql('postgres',
     "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-my $current_lsn =
-  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");

 # Note the presence of this checkpoint for the archive_cleanup_command
 # check done below, before switching to a new segment.
 $node_primary->safe_psql('postgres', "CHECKPOINT");

+my $current_lsn =
+  $node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+
 # Force archiving of WAL file to make it present on primary
 $node_primary->safe_psql('postgres', "SELECT pg_switch_wal()");


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Lowering the ever-growing heap->pd_lower
Next
From: Peter Geoghegan
Date:
Subject: Re: Lowering the ever-growing heap->pd_lower