Re: Should the archiver process always make sure that the timeline history files exist in the archive? - Mailing list pgsql-hackers

From Jimmy Yih
Subject Re: Should the archiver process always make sure that the timeline history files exist in the archive?
Date
Msg-id BYAPR05MB645461214A671F584EEA734ABD15A@BYAPR05MB6454.namprd05.prod.outlook.com
Whole thread Raw
In response to Should the archiver process always make sure that the timeline history files exist in the archive?  (Jimmy Yih <jyih@vmware.com>)
Responses Re: Should the archiver process always make sure that the timeline history files exist in the archive?
List pgsql-hackers
Hello pgsql-hackers,

After doing some more debugging on the matter, I believe this issue might be a
minor regression from commit 5332b8cec541. Prior to that commit, the archiver
process when first started on a previously promoted primary would have all the
timeline history files marked as ready for immediate archiving. If that had
happened, none of my mentioned failure scenarios would be theoretically possible
(barring someone manually deleting the timeline history files). With that in
mind, I decided to look more into my Question 1 and created a patch proposal.
The attached patch will try to archive the current timeline history file if it
has not been archived yet when the archiver process starts up.

Regards,
Jimmy Yih

________________________________________
From: Jimmy Yih <jyih@vmware.com>
Sent: Wednesday, August 9, 2023 5:00 PM
To: pgsql-hackers@postgresql.org
Subject: Should the archiver process always make sure that the timeline history files exist in the archive?

Hello pgsql-hackers,

While testing out some WAL archiving and PITR scenarios, it was observed that
enabling WAL archiving for the first time on a primary that was on a timeline
higher than 1 would not initially archive the timeline history file for the
timeline it was currently on. While this might be okay for most use cases, there
are scenarios where this leads to unexpected failures that seem to expose some
flaws in the logic.

Scenario 1:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Create a
standby with that backup that will be continuously restoring from the WAL
archives, the standby will not contain the timeline 2 history file. The standby
will operate normally but if you try to create a cascading standby off it using
streaming replication, the cascade standby's WAL receiver will continuously
FATAL trying to request the timeline 2 history file that the main standby does
not have.

Scenario 2:
Take a backup of a primary on timeline 2 with `pg_basebackup -Xnone`. Then try
to create a new node by doing PITR with recovery_target_timeline set to
'current' or 'latest' which will succeed. However, doing PITR with
recovery_target_timeline = '2' will fail since it is unable to find the timeline
2 history file in the WAL archives. This may be a bit contradicting since we
allow 'current' and 'latest' to recover but explicitly setting the
recovery_target_timeline to the control file's timeline id ends up with failure.

Attached is a patch containing two TAP tests that demonstrate the scenarios.

My questions are:
1. Why doesn't the archiver process try to archive timeline history files when
   WAL archiving is first configured and/or continually check (maybe when the
   archiver process gets started before the main loop)?
2. Why does explicitly setting the recovery_target_timeline to the control
   file's timeline id not follow the same logic as recovery_target_timeline set
   to 'current'?
3. Why does a cascaded standby require the timeline history file of its control
   file's timeline id (startTLI) when the main replica is able to operate fine
   without the timeline history file?

Note that my initial observations came from testing with pgBackRest (copying
pg_wal/ during backup is disabled by default) but using `pg_basebackup -Xnone`
reproduced the issues similarly and is what I present in the TAP tests. At the
moment, the only workaround I can think of is to manually run the
archive_command on the missing timeline history file(s).

Are these valid issues that should be looked into or are they expected? Scenario
2 seems like it could be easily fixed if we determine that the
recovery_target_timeline numeric value is equal to the control file's timeline
id (compare rtli and recoveryTargetTLI in validateRecoveryParameters()?) but I
wasn't sure if maybe the opposite was true where we should make 'current' and
'latest' require retrieving the timeline history files instead to help prevent
Scenario 1.

Regards,
Jimmy Yih

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
Next
From: Zhang Mingli
Date:
Subject: Fix typo in src/interfaces/libpq/po/zh_CN.po