Hello,
A colleague of mine reported an expected behavior.
On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.
See in attachment the reproduction script "non-arch-wal-on-recovery.bash".
This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.
bool inRecovery = RecoveryInProgress();
/*
* The file is always deletable if archive_mode is "off". On standbys
* archiving is disabled if archive_mode is "on", and enabled with
* "always". On a primary, archiving is enabled if archive_mode is "on"
* or "always".
*/
if (!((XLogArchivingActive() && !inRecovery) ||
(XLogArchivingAlways() && inRecovery)))
return true;
Please find in attachment a patch that fix this issue using the following test
instead:
if (!((XLogArchivingActive() && !StandbyModeRequested) ||
(XLogArchivingAlways() && inRecovery)))
return true;
I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?
Regards,