Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
Date
Msg-id YlT23IvsXkGuLzFi@paquier.xyz
Whole thread Raw
In response to Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Thomas Munro <thomas.munro@gmail.com>)
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Mon, Apr 11, 2022 at 06:48:58PM +1200, Thomas Munro wrote:
> 1.  This test had some pre-existing bugs/races, which hadn't failed
> before due to scheduling, even under Valgrind.  The above changes
> appear to fix those problems.  To Michael for comment.

Yeah, there are two problems here.  From what I can see, ensuring the
execution of archive_cleanup_command on the standby needs the
checkpoint on the primary and the restart point on the standby.  So
pg_current_wal_lsn() should be located after the primary's checkpoint
and not before it so as we are sure that the checkpoint records finds
its way to the standby.  That's what Tom mentioned upthread.

The second problem is to make sure that $standby2 sees the promotion
of $standby and its history file, but we also want to recover
00000002.history from some archives to create a RECOVERYHISTORY at
recovery for the purpose of the test.  Switching to a new segment as
proposed by Andres does not seem completely right to me because we are
not 100% sure of the ordering an archive is going to happen, no?  I
think that the logic to create $standby2 from the initial backup of
the primary is right, because there is no 00000002.history in it, but
we also need to be sure that 00000002.history has been archived once
the promotion of $standby is done.  This can be validated thanks to
the logs, actually.

>> What is that second test really testing?
>>
>> # Check the presence of temporary files specifically generated during
>> # archive recovery.  To ensure the presence of the temporary history
>> # file, switch to a timeline large enough to allow a standby to recover
>> # a history file from an archive.  As this requires at least two timeline
>> # switches, promote the existing standby first.  Then create a second
>> # standby based on the promoted one.  Finally, the second standby is
>> # promoted.
>>
>> Note "Then create a second standby based on the promoted one." - but that's
>> not actually what's happening:
>
> 2.  There may also be other problems with the test but those aren't
> relevant to skink's failure, which starts on the 5th test.  To Michael
> for comment.

This comes from df86e52, where we want to recovery a history file that
would be created as RECOVERYHISTORY and make sure that the file gets
removed at the end of recovery.  So $standby2 should choose a new
timeline different from the one of chosen by $standby.  Looking back
at what has been done, it seems to me that the comment is the
incorrect part:
https://www.postgresql.org/message-id/20190930080340.GO2888@paquier.xyz

All that stuff leads me to the attached.  Thoughts?
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: API stability
Next
From: Amit Kapila
Date:
Subject: Re: PG DOCS - logical replication filtering