Re: [BUG] non archived WAL removed during production crash recovery - Mailing list pgsql-bugs

From Jehan-Guillaume de Rorthais
Subject Re: [BUG] non archived WAL removed during production crash recovery
Date
Msg-id 20200403182625.0fccc6fd@firost
Whole thread Raw
In response to Re: [BUG] non archived WAL removed during production crash recovery  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: [BUG] non archived WAL removed during production crash recovery
List pgsql-bugs
On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

> On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:
>> On Thu, 2 Apr 2020 19:38:59 +0900
>> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
[...]
>>> That is, WAL files with .ready files are removed when either
>>> archive_mode!=always in standby mode or archive_mode=off.  
>> 
>> sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
    [...]
    if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
          (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
          (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
        return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur status
of the cluster: primary or standby. Normal shortcut will apply during first
checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from a
snapshot or after a shutdown with immediate) is restored to build a standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

> BTW, now I'm thinking that the flag in shmem should be updated when
> the startup process sets StandbyModeRequested to true at the beginning
> of the recovery. That is,
> 
> - Add something like SharedStandbyModeRequested into XLogCtl. This field
>     should be initialized with false;
> - Set XLogCtl->SharedStandbyModeRequested to true when the startup
>     process detects the standby.signal file and sets the local variable
>     StandbyModeRequested to true.
> - Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
>     to know whether the server is in standby mode or not.
> 
> Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle differences
around the same subject makes it a bit trappy and complicated to understand.

If my proposal is rejected, I'll be happy to volunteer to add
XLogCtl->SharedStandbyModeRequested though. It seems like a simple enough fix
as well.

Regards,

Attachment

pgsql-bugs by date:

Previous
From: Dmitry Dolgov
Date:
Subject: Re: BUG #16325: Assert failure on partitioning by int for a textvalue with a collation
Next
From: Fujii Masao
Date:
Subject: Re: [BUG] non archived WAL removed during production crash recovery