[BUG] non archived WAL removed during production crash recovery - Mailing list pgsql-bugs

From Jehan-Guillaume de Rorthais
Subject [BUG] non archived WAL removed during production crash recovery
Date
Msg-id 20200331172229.40ee00dc@firost
Whole thread Raw
Responses Re: [BUG] non archived WAL removed during production crash recovery  (Fujii Masao <masao.fujii@oss.nttdata.com>)
List pgsql-bugs
Hello,

A colleague of mine reported an expected behavior.

On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.

See in attachment the reproduction script "non-arch-wal-on-recovery.bash".

This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.

  bool inRecovery = RecoveryInProgress();
  
  /*
   * The file is always deletable if archive_mode is "off".  On standbys
   * archiving is disabled if archive_mode is "on", and enabled with
   * "always".  On a primary, archiving is enabled if archive_mode is "on"
   * or "always".
   */
  if (!((XLogArchivingActive() && !inRecovery) ||
        (XLogArchivingAlways() && inRecovery)))
      return true;

Please find in attachment a patch that fix this issue using the following test
instead:

  if (!((XLogArchivingActive() && !StandbyModeRequested) ||
        (XLogArchivingAlways() && inRecovery)))
      return true;

I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?

Regards,

Attachment

pgsql-bugs by date:

Previous
From: Devrim Gündüz
Date:
Subject: Re: BUG #16307: pgdg11-updates-debuginfo YUM repository missingRHEL releasever directories
Next
From: Michael Paquier
Date:
Subject: Re: BUG #16330: psql accesses null pointer in connect.c:do_connect