Thread: [BUG] non archived WAL removed during production crash recovery

[BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

31 March 2020, 15:22:29

Hello,

A colleague of mine reported an expected behavior.

On production cluster is in crash recovery, eg. after killing a backend, the
WALs ready to be archived are removed before being archived.

See in attachment the reproduction script "non-arch-wal-on-recovery.bash".

This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
Function XLogArchiveCheckDone() badly consider the in crashed recovery
production cluster as a standby without archive_mode=always. So the check
conclude the WAL can be removed safely.

  bool inRecovery = RecoveryInProgress();
  
  /*
   * The file is always deletable if archive_mode is "off".  On standbys
   * archiving is disabled if archive_mode is "on", and enabled with
   * "always".  On a primary, archiving is enabled if archive_mode is "on"
   * or "always".
   */
  if (!((XLogArchivingActive() && !inRecovery) ||
        (XLogArchivingAlways() && inRecovery)))
      return true;

Please find in attachment a patch that fix this issue using the following test
instead:

  if (!((XLogArchivingActive() && !StandbyModeRequested) ||
        (XLogArchivingAlways() && inRecovery)))
      return true;

I'm not sure if we should rely on StandbyModeRequested for the second part of
the test as well thought. What was the point to rely on RecoveryInProgress() to
get the recovery status from shared mem?

Regards,

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

01 April 2020, 08:27:22


On 2020/04/01 0:22, Jehan-Guillaume de Rorthais wrote:
> Hello,
> 
> A colleague of mine reported an expected behavior.
> 
> On production cluster is in crash recovery, eg. after killing a backend, the
> WALs ready to be archived are removed before being archived.
> 
> See in attachment the reproduction script "non-arch-wal-on-recovery.bash".
> 
> This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
> Function XLogArchiveCheckDone() badly consider the in crashed recovery
> production cluster as a standby without archive_mode=always. So the check
> conclude the WAL can be removed safely.

Thanks for the report! Yeah, this seems a bug.

>    bool inRecovery = RecoveryInProgress();
>    
>    /*
>     * The file is always deletable if archive_mode is "off".  On standbys
>     * archiving is disabled if archive_mode is "on", and enabled with
>     * "always".  On a primary, archiving is enabled if archive_mode is "on"
>     * or "always".
>     */
>    if (!((XLogArchivingActive() && !inRecovery) ||
>          (XLogArchivingAlways() && inRecovery)))
>        return true;
> 
> Please find in attachment a patch that fix this issue using the following test
> instead:
> 
>    if (!((XLogArchivingActive() && !StandbyModeRequested) ||
>          (XLogArchivingAlways() && inRecovery)))
>        return true;
> 
> I'm not sure if we should rely on StandbyModeRequested for the second part of
> the test as well thought. What was the point to rely on RecoveryInProgress() to
> get the recovery status from shared mem?

Since StandbyModeRequested is the startup process-local variable,
it basically cannot be used in XLogArchiveCheckDone() that other process
may call. So another approach would be necessary... One straight idea is to
add new shmem flag indicating whether we are in standby mode or not.
Another is to make the startup process remove .ready file if necessary.

If it's not easy to fix the issue, we might need to just revert the commit
that introduced the issue at first.

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

01 April 2020, 16:17:35

On Wed, 1 Apr 2020 17:27:22 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
[...]
> >    bool inRecovery = RecoveryInProgress();
> >    
> >    /*
> >     * The file is always deletable if archive_mode is "off".  On standbys
> >     * archiving is disabled if archive_mode is "on", and enabled with
> >     * "always".  On a primary, archiving is enabled if archive_mode is "on"
> >     * or "always".
> >     */
> >    if (!((XLogArchivingActive() && !inRecovery) ||
> >          (XLogArchivingAlways() && inRecovery)))
> >        return true;
> > 
> > Please find in attachment a patch that fix this issue using the following
> > test instead:
> > 
> >    if (!((XLogArchivingActive() && !StandbyModeRequested) ||
> >          (XLogArchivingAlways() && inRecovery)))
> >        return true;
> > 
> > I'm not sure if we should rely on StandbyModeRequested for the second part
> > of the test as well thought. What was the point to rely on
> > RecoveryInProgress() to get the recovery status from shared mem?  
> 
> Since StandbyModeRequested is the startup process-local variable,
> it basically cannot be used in XLogArchiveCheckDone() that other process
> may call.

Ok, you answered my wondering about using recovery status from shared mem. This
was obvious. Thanks for your help!

I was wondering if we could use "ControlFile->state != DB_IN_CRASH_RECOVERY".
It seems fine during crash recovery as the control file is updated before the
checkpoint, but it doesn't feel right for other code paths where the control
file might not be up-to-date on filesystem, right ?

> So another approach would be necessary... One straight idea is to
> add new shmem flag indicating whether we are in standby mode or not.

I was thinking about setting XLogCtlData->SharedRecoveryInProgress as an enum
using:

  enum RecoveryState
  {
    NOT_IN_RECOVERY = 0
    IN_CRASH_RECOVERY,
    IN_ARCHIVE_RECOVERY
  }

Please, find in attachment a patch implementing this.

Plus, I added a second commit to add one test in regard with this bug.

> Another is to make the startup process remove .ready file if necessary.

I'm not sure to understand this one.

Regards,

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

02 April 2020, 04:04:43

Hello.

At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> Please, find in attachment a patch implementing this.

The patch partially reintroduces the issue the patch have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

Perhaps checking '.ready' before the checking for archive-mode would
be sufficient.

> Plus, I added a second commit to add one test in regard with this bug.
> 
> > Another is to make the startup process remove .ready file if necessary.
> 
> I'm not sure to understand this one.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

02 April 2020, 04:07:34

Sorry, it was quite ambiguous.

At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> > Please, find in attachment a patch implementing this.
> 
> The patch partially reintroduces the issue the patch have
> fixed. Specifically a standby running a crash recovery wrongly marks a
> WAL file as ".ready" if it is extant in pg_wal without accompanied by
> .ready file.

The patch partially reintroduces the issue the commit 78ea8b5daa have
fixed. Specifically a standby running a crash recovery wrongly marks a
WAL file as ".ready" if it is extant in pg_wal without accompanied by
.ready file.

> Perhaps checking '.ready' before the checking for archive-mode would
> be sufficient.
> 
> > Plus, I added a second commit to add one test in regard with this bug.
> > 
> > > Another is to make the startup process remove .ready file if necessary.
> > 
> > I'm not sure to understand this one.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

02 April 2020, 05:19:15

On 2020/04/02 13:07, Kyotaro Horiguchi wrote:
> Sorry, it was quite ambiguous.
> 
> At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
>> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in
>>> Please, find in attachment a patch implementing this.
>>
>> The patch partially reintroduces the issue the patch have
>> fixed. Specifically a standby running a crash recovery wrongly marks a
>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>> .ready file.
> 
> The patch partially reintroduces the issue the commit 78ea8b5daa have
> fixed. Specifically a standby running a crash recovery wrongly marks a
> WAL file as ".ready" if it is extant in pg_wal without accompanied by
> .ready file.

On second thought, I think that we should discuss what the desirable
behavior is before the implentation. Especially what's unclear to me
is whether to remove such WAL files in archive recovery case with
archive_mode=on. Those WAL files would be required when recovering
from the backup taken before that archive recovery happens.
So it seems unsafe to remove them in that case.

Therefore, IMO that the patch should change the code so that
no unarchived WAL files are removed not only in crash recovery
but also archive recovery. Thought?

Of course, this change would lead to the issue that the past unarchived
WAL files keep remaining in the case of warm-standby using archive
recovery. But this issue looks unavoidable. If users want to avoid that,
archive_mode should be set to always.

Also I'm a bit wondering if it's really safe to remove such unarchived
WAL files even in the standby case with archive_mode=on. I would need
more time to think that.

>> Perhaps checking '.ready' before the checking for archive-mode would
>> be sufficient.
>>
>>> Plus, I added a second commit to add one test in regard with this bug.
>>>
>>>> Another is to make the startup process remove .ready file if necessary.
>>>
>>> I'm not sure to understand this one.

I was thinking to make the startup process remove such unarchived WAL files
if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
is true.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

02 April 2020, 07:23:23

At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> On 2020/04/02 13:07, Kyotaro Horiguchi wrote:
> > Sorry, it was quite ambiguous.
> > At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote in
> >> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
> >> <jgdr@dalibo.com> wrote in
> >>> Please, find in attachment a patch implementing this.
> >>
> >> The patch partially reintroduces the issue the patch have
> >> fixed. Specifically a standby running a crash recovery wrongly marks a
> >> WAL file as ".ready" if it is extant in pg_wal without accompanied by
> >> .ready file.
> > The patch partially reintroduces the issue the commit 78ea8b5daa have
> > fixed. Specifically a standby running a crash recovery wrongly marks a
> > WAL file as ".ready" if it is extant in pg_wal without accompanied by
> > .ready file.
> 
> On second thought, I think that we should discuss what the desirable
> behavior is before the implentation. Especially what's unclear to me

Agreed.

> is whether to remove such WAL files in archive recovery case with
> archive_mode=on. Those WAL files would be required when recovering
> from the backup taken before that archive recovery happens.
> So it seems unsafe to remove them in that case.

I'm not sure I'm getting the intention correctly, but I think it
responsibility of the operator to provide a complete set of archived
WAL files for a backup.  Could you elaborate what operation steps are
you assuming of?

> Therefore, IMO that the patch should change the code so that
> no unarchived WAL files are removed not only in crash recovery
> but also archive recovery. Thought?

Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
and it should be archived immediately.  My previous mail is written
based on the same thought.

In a very narrow window, if server crashed or killed after a segment
is finished but before marking the file as .ready, the file doesn't
have .ready but should be archived. If we need to get rid of such a
window, it would help to mark a WAL file as ".busy" at creation time.

> Of course, this change would lead to the issue that the past
> unarchived
> WAL files keep remaining in the case of warm-standby using archive
> recovery. But this issue looks unavoidable. If users want to avoid
> that,
> archive_mode should be set to always.
> 
> Also I'm a bit wondering if it's really safe to remove such unarchived
> WAL files even in the standby case with archive_mode=on. I would need
> more time to think that.
> 
> >> Perhaps checking '.ready' before the checking for archive-mode would
> >> be sufficient.
> >>
> >>> Plus, I added a second commit to add one test in regard with this bug.
> >>>
> >>>> Another is to make the startup process remove .ready file if
> >>>> necessary.
> >>>
> >>> I'm not sure to understand this one.
> 
> I was thinking to make the startup process remove such unarchived WAL
> files
> if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
> is true.

As mentioned above, I don't understand the point of preserving WAL
files that are either marked as .ready or not marked at all on a
standby with archive_mode=on.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

02 April 2020, 10:38:59


On 2020/04/02 16:23, Kyotaro Horiguchi wrote:
> At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
>> On 2020/04/02 13:07, Kyotaro Horiguchi wrote:
>>> Sorry, it was quite ambiguous.
>>> At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
>>> <horikyota.ntt@gmail.com> wrote in
>>>> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
>>>> <jgdr@dalibo.com> wrote in
>>>>> Please, find in attachment a patch implementing this.
>>>>
>>>> The patch partially reintroduces the issue the patch have
>>>> fixed. Specifically a standby running a crash recovery wrongly marks a
>>>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>>>> .ready file.
>>> The patch partially reintroduces the issue the commit 78ea8b5daa have
>>> fixed. Specifically a standby running a crash recovery wrongly marks a
>>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>>> .ready file.
>>
>> On second thought, I think that we should discuss what the desirable
>> behavior is before the implentation. Especially what's unclear to me
> 
> Agreed.
> 
>> is whether to remove such WAL files in archive recovery case with
>> archive_mode=on. Those WAL files would be required when recovering
>> from the backup taken before that archive recovery happens.
>> So it seems unsafe to remove them in that case.
> 
> I'm not sure I'm getting the intention correctly, but I think it
> responsibility of the operator to provide a complete set of archived
> WAL files for a backup.  Could you elaborate what operation steps are
> you assuming of?

Please imagine the case where you need to do archive recovery
from the database snapshot taken while there are many WAL files
with .ready files. Those WAL files have not been archived yet.
In this case, ISTM those WAL files should not be removed until
they are archived, when archive_mode = on.

>> Therefore, IMO that the patch should change the code so that
>> no unarchived WAL files are removed not only in crash recovery
>> but also archive recovery. Thought?
> 
> Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
> and it should be archived immediately.  My previous mail is written
> based on the same thought.

Ok, so our *current* consensus seems the followings. Right?

- If archive_mode=off, any WAL files with .ready files are removed in
    crash recovery, archive recoery and standby mode.

- If archive_mode=on, WAL files with .ready files are removed only in
    standby mode. In crash recovery and archive recovery cases, they keep
    remaining and would be archived after recovery finishes (i.e., during
    normal processing).

- If archive_mode=always, in crash recovery, archive recovery and
    standby mode, WAL files with .ready files are archived if WAL archiver
    is running.

That is, WAL files with .ready files are removed when either
archive_mode!=always in standby mode or archive_mode=off.

> In a very narrow window, if server crashed or killed after a segment
> is finished but before marking the file as .ready, the file doesn't
> have .ready but should be archived. If we need to get rid of such a
> window, it would help to mark a WAL file as ".busy" at creation time.
> 
>> Of course, this change would lead to the issue that the past
>> unarchived
>> WAL files keep remaining in the case of warm-standby using archive
>> recovery. But this issue looks unavoidable. If users want to avoid
>> that,
>> archive_mode should be set to always.
>>
>> Also I'm a bit wondering if it's really safe to remove such unarchived
>> WAL files even in the standby case with archive_mode=on. I would need
>> more time to think that.
>>
>>>> Perhaps checking '.ready' before the checking for archive-mode would
>>>> be sufficient.
>>>>
>>>>> Plus, I added a second commit to add one test in regard with this bug.
>>>>>
>>>>>> Another is to make the startup process remove .ready file if
>>>>>> necessary.
>>>>>
>>>>> I'm not sure to understand this one.
>>
>> I was thinking to make the startup process remove such unarchived WAL
>> files
>> if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
>> is true.
> 
> As mentioned above, I don't understand the point of preserving WAL
> files that are either marked as .ready or not marked at all on a
> standby with archive_mode=on.

Maybe yes. But I'm not confident about that there is no such case.
Anyway I'm fine to fix the bug based on the above consensus at first.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

02 April 2020, 13:02:34

On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> Sorry, it was quite ambiguous.
> 
> At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote in 
> > At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote in   
> > > Please, find in attachment a patch implementing this.  
> > 
> > The patch partially reintroduces the issue the patch have
> > fixed. Specifically a standby running a crash recovery wrongly marks a
> > WAL file as ".ready" if it is extant in pg_wal without accompanied by
> > .ready file.  
> 
> The patch partially reintroduces the issue the commit 78ea8b5daa have
> fixed. Specifically a standby running a crash recovery wrongly marks a
> WAL file as ".ready" if it is extant in pg_wal without accompanied by
> .ready file.

As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY are
only set for production clusters, not standby ones. So the following test
should never catch standby cluster as :

  (XLogArchivingActive() && inRecoveryState != IN_ARCHIVE_RECOVERY)

Forgive me if I'm wrong, but do I miss something?

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

02 April 2020, 13:49:15

On Thu, 2 Apr 2020 19:38:59 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

> On 2020/04/02 16:23, Kyotaro Horiguchi wrote:
> > At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao
> > <masao.fujii@oss.nttdata.com> wrote in
[...]
> >> is whether to remove such WAL files in archive recovery case with
> >> archive_mode=on. Those WAL files would be required when recovering
> >> from the backup taken before that archive recovery happens.
> >> So it seems unsafe to remove them in that case.
> >
> > I'm not sure I'm getting the intention correctly, but I think it
> > responsibility of the operator to provide a complete set of archived
> > WAL files for a backup.  Could you elaborate what operation steps are
> > you assuming of?
>
> Please imagine the case where you need to do archive recovery
> from the database snapshot taken while there are many WAL files
> with .ready files. Those WAL files have not been archived yet.
> In this case, ISTM those WAL files should not be removed until
> they are archived, when archive_mode = on.

If you rely on snapshot without pg_start/stop_backup, I agree. Theses WAL
should be archived if:

* archive_mode >= on for primary
* archive_mode = always for standby

> >> Therefore, IMO that the patch should change the code so that
> >> no unarchived WAL files are removed not only in crash recovery
> >> but also archive recovery. Thought?
> >
> > Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
> > and it should be archived immediately.  My previous mail is written
> > based on the same thought.
>
> Ok, so our *current* consensus seems the followings. Right?
>
> - If archive_mode=off, any WAL files with .ready files are removed in
>     crash recovery, archive recoery and standby mode.

yes

> - If archive_mode=on, WAL files with .ready files are removed only in
>     standby mode. In crash recovery and archive recovery cases, they keep
>     remaining and would be archived after recovery finishes (i.e., during
>     normal processing).

yes

> - If archive_mode=always, in crash recovery, archive recovery and
>     standby mode, WAL files with .ready files are archived if WAL archiver
>     is running.

yes

> That is, WAL files with .ready files are removed when either
> archive_mode!=always in standby mode or archive_mode=off.

sounds fine to me.

[...]
> >>>>>> Another is to make the startup process remove .ready file if
> >>>>>> necessary.
> >>>>>
> >>>>> I'm not sure to understand this one.
> >>
> >> I was thinking to make the startup process remove such unarchived WAL
> >> files
> >> if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
> >> is true.

Ok, understood.

> > As mentioned above, I don't understand the point of preserving WAL
> > files that are either marked as .ready or not marked at all on a
> > standby with archive_mode=on.
>
> Maybe yes. But I'm not confident about that there is no such case.

Well, it seems to me that this is what you suggested few paragraph away:

  «.ready files are removed when either archive_mode!=always in standby mode»

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

02 April 2020, 14:55:46


On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:
> On Thu, 2 Apr 2020 19:38:59 +0900
> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> 
>> On 2020/04/02 16:23, Kyotaro Horiguchi wrote:
>>> At Thu, 2 Apr 2020 14:19:15 +0900, Fujii Masao
>>> <masao.fujii@oss.nttdata.com> wrote in
> [...]
>>>> is whether to remove such WAL files in archive recovery case with
>>>> archive_mode=on. Those WAL files would be required when recovering
>>>> from the backup taken before that archive recovery happens.
>>>> So it seems unsafe to remove them in that case.
>>>
>>> I'm not sure I'm getting the intention correctly, but I think it
>>> responsibility of the operator to provide a complete set of archived
>>> WAL files for a backup.  Could you elaborate what operation steps are
>>> you assuming of?
>>
>> Please imagine the case where you need to do archive recovery
>> from the database snapshot taken while there are many WAL files
>> with .ready files. Those WAL files have not been archived yet.
>> In this case, ISTM those WAL files should not be removed until
>> they are archived, when archive_mode = on.
> 
> If you rely on snapshot without pg_start/stop_backup, I agree. Theses WAL
> should be archived if:
> 
> * archive_mode >= on for primary
> * archive_mode = always for standby
> 
>>>> Therefore, IMO that the patch should change the code so that
>>>> no unarchived WAL files are removed not only in crash recovery
>>>> but also archive recovery. Thought?
>>>
>>> Agreed if "an unarchived WAL" means "a WAL file that is marked .ready"
>>> and it should be archived immediately.  My previous mail is written
>>> based on the same thought.
>>
>> Ok, so our *current* consensus seems the followings. Right?
>>
>> - If archive_mode=off, any WAL files with .ready files are removed in
>>      crash recovery, archive recoery and standby mode.
> 
> yes
> 
>> - If archive_mode=on, WAL files with .ready files are removed only in
>>      standby mode. In crash recovery and archive recovery cases, they keep
>>      remaining and would be archived after recovery finishes (i.e., during
>>      normal processing).
> 
> yes
> 
>> - If archive_mode=always, in crash recovery, archive recovery and
>>      standby mode, WAL files with .ready files are archived if WAL archiver
>>      is running.
> 
> yes
> 
>> That is, WAL files with .ready files are removed when either
>> archive_mode!=always in standby mode or archive_mode=off.
> 
> sounds fine to me.
> 
> [...]
>>>>>>>> Another is to make the startup process remove .ready file if
>>>>>>>> necessary.
>>>>>>>
>>>>>>> I'm not sure to understand this one.
>>>>
>>>> I was thinking to make the startup process remove such unarchived WAL
>>>> files
>>>> if archive_mode=on and StandbyModeRequested/ArchiveRecoveryRequested
>>>> is true.
> 
> Ok, understood.
> 
>>> As mentioned above, I don't understand the point of preserving WAL
>>> files that are either marked as .ready or not marked at all on a
>>> standby with archive_mode=on.
>>
>> Maybe yes. But I'm not confident about that there is no such case.
> 
> Well, it seems to me that this is what you suggested few paragraph away:
> 
>    «.ready files are removed when either archive_mode!=always in standby mode»

Yes, so I'm fine with that as the first consensus because the behavior
is obviously better than the current one. *If* the case where no WAL files
should be removed is found, I'd just like to propose the additional patch.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

02 April 2020, 14:58:00


On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:
> On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> 
>> Sorry, it was quite ambiguous.
>>
>> At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
>> <horikyota.ntt@gmail.com> wrote in
>>> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
>>> <jgdr@dalibo.com> wrote in
>>>> Please, find in attachment a patch implementing this.
>>>
>>> The patch partially reintroduces the issue the patch have
>>> fixed. Specifically a standby running a crash recovery wrongly marks a
>>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>>> .ready file.
>>
>> The patch partially reintroduces the issue the commit 78ea8b5daa have
>> fixed. Specifically a standby running a crash recovery wrongly marks a
>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>> .ready file.
> 
> As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY are
> only set for production clusters, not standby ones.

DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
if you start the standby from the cold backup of the primary,
since InArchiveRecovery is false at the beginning of the recovery,
DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
DB_IN_ARCHIVE_RECOVERY is set.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

02 April 2020, 15:37:44

On Thu, 2 Apr 2020 23:58:00 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

> On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
> > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >   
> >> Sorry, it was quite ambiguous.
> >>
> >> At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
> >> <horikyota.ntt@gmail.com> wrote in  
> >>> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
> >>> <jgdr@dalibo.com> wrote in  
> >>>> Please, find in attachment a patch implementing this.  
> >>>
> >>> The patch partially reintroduces the issue the patch have
> >>> fixed. Specifically a standby running a crash recovery wrongly marks a
> >>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
> >>> .ready file.  
> >>
> >> The patch partially reintroduces the issue the commit 78ea8b5daa have
> >> fixed. Specifically a standby running a crash recovery wrongly marks a
> >> WAL file as ".ready" if it is extant in pg_wal without accompanied by
> >> .ready file.  
> > 
> > As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY
> > are only set for production clusters, not standby ones.  
> 
> DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
> if you start the standby from the cold backup of the primary,

In cold backup? Then ControlFile->state == DB_SHUTDOWNED, right?

Unless I'm wrong, this should be catched by:

  if (ArchiveRecoveryRequested && ( [...] ||
     ControlFile->state == DB_SHUTDOWNED))
  {
    InArchiveRecovery = true;
    if (StandbyModeRequested)
        StandbyMode = true;
  }

With InArchiveRecovery=true, we later set DB_IN_ARCHIVE_RECOVERY instead of
DB_IN_CRASH_RECOVERY.


> since InArchiveRecovery is false at the beginning of the recovery,
> DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
> WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
> DB_IN_ARCHIVE_RECOVERY is set.

However, I suppose this is true if you restore a backup from a snapshot
without backup_label, right?

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

02 April 2020, 15:44:50

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

[...]
> > Well, it seems to me that this is what you suggested few paragraph away:
> >
> >    «.ready files are removed when either archive_mode!=always in standby
> > mode»
>
> Yes, so I'm fine with that as the first consensus because the behavior
> is obviously better than the current one. *If* the case where no WAL files
> should be removed is found, I'd just like to propose the additional patch.

Do you mean to want to produce the next patch yourself?

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

03 April 2020, 01:14:17

At Thu, 2 Apr 2020 17:44:50 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in
> On Thu, 2 Apr 2020 23:55:46 +0900
> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> [...]
> > > Well, it seems to me that this is what you suggested few paragraph away:
> > >
> > >    «.ready files are removed when either archive_mode!=always in standby
> > > mode»
> >
> > Yes, so I'm fine with that as the first consensus because the behavior
> > is obviously better than the current one. *If* the case where no WAL files
> > should be removed is found, I'd just like to propose the additional patch.
>
> Do you mean to want to produce the next patch yourself?

No. Fujii-san is saying that he will address it, if the fix made in
this thread is found to be imperfect later.

He suspects that WAL files should be preserved at least in certain
cases even if the file persists forever, but the consensus here is to
remove files under certain conditions so as not to no WAL file
persists in pg_wal directory.

Feel free to propose the next version!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

03 April 2020, 06:44:40


On 2020/04/03 0:37, Jehan-Guillaume de Rorthais wrote:
> On Thu, 2 Apr 2020 23:58:00 +0900
> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> 
>> On 2020/04/02 22:02, Jehan-Guillaume de Rorthais wrote:
>>> On Thu, 02 Apr 2020 13:07:34 +0900 (JST)
>>> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>>>    
>>>> Sorry, it was quite ambiguous.
>>>>
>>>> At Thu, 02 Apr 2020 13:04:43 +0900 (JST), Kyotaro Horiguchi
>>>> <horikyota.ntt@gmail.com> wrote in
>>>>> At Wed, 1 Apr 2020 18:17:35 +0200, Jehan-Guillaume de Rorthais
>>>>> <jgdr@dalibo.com> wrote in
>>>>>> Please, find in attachment a patch implementing this.
>>>>>
>>>>> The patch partially reintroduces the issue the patch have
>>>>> fixed. Specifically a standby running a crash recovery wrongly marks a
>>>>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>>>>> .ready file.
>>>>
>>>> The patch partially reintroduces the issue the commit 78ea8b5daa have
>>>> fixed. Specifically a standby running a crash recovery wrongly marks a
>>>> WAL file as ".ready" if it is extant in pg_wal without accompanied by
>>>> .ready file.
>>>
>>> As far as I understand StartupXLOG(), NOT_IN_RECOVERY and IN_CRASH_RECOVERY
>>> are only set for production clusters, not standby ones.
>>
>> DB_IN_CRASH_RECOVERY can be set even in standby mode. For example,
>> if you start the standby from the cold backup of the primary,
> 
> In cold backup? Then ControlFile->state == DB_SHUTDOWNED, right?
> 
> Unless I'm wrong, this should be catched by:
> 
>    if (ArchiveRecoveryRequested && ( [...] ||
>      ControlFile->state == DB_SHUTDOWNED))
>    {
>     InArchiveRecovery = true;
>     if (StandbyModeRequested)
>         StandbyMode = true;
>    }
> 
> With InArchiveRecovery=true, we later set DB_IN_ARCHIVE_RECOVERY instead of
> DB_IN_CRASH_RECOVERY.

Yes, you're right. So I had to mention one more condition in my
previous email. The condition is that the cold backup was taken from
the server that was shutdowned with immdiate mode. In this case,
the code block that you pointed is skipped and InArchiveRecovery is
not set to true there.

>> since InArchiveRecovery is false at the beginning of the recovery,
>> DB_IN_CRASH_RECOVERY is set in that moment. But then after all the valid
>> WAL in pg_wal have been replayed, InArchiveRecovery is set to true and
>> DB_IN_ARCHIVE_RECOVERY is set.
> 
> However, I suppose this is true if you restore a backup from a snapshot
> without backup_label, right?

Maybe yes.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

03 April 2020, 06:45:31


On 2020/04/03 10:14, Kyotaro Horiguchi wrote:
> At Thu, 2 Apr 2020 17:44:50 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in
>> On Thu, 2 Apr 2020 23:55:46 +0900
>> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> [...]
>>>> Well, it seems to me that this is what you suggested few paragraph away:
>>>>
>>>>     «.ready files are removed when either archive_mode!=always in standby
>>>> mode»
>>>
>>> Yes, so I'm fine with that as the first consensus because the behavior
>>> is obviously better than the current one. *If* the case where no WAL files
>>> should be removed is found, I'd just like to propose the additional patch.
>>
>> Do you mean to want to produce the next patch yourself?
> 
> No. Fujii-san is saying that he will address it, if the fix made in
> this thread is found to be imperfect later.
> 
> He suspects that WAL files should be preserved at least in certain
> cases even if the file persists forever, but the consensus here is to
> remove files under certain conditions so as not to no WAL file
> persists in pg_wal directory.

Yes.

> Feel free to propose the next version!

Yes!

BTW, now I'm thinking that the flag in shmem should be updated when
the startup process sets StandbyModeRequested to true at the beginning
of the recovery. That is,

- Add something like SharedStandbyModeRequested into XLogCtl. This field
    should be initialized with false;
- Set XLogCtl->SharedStandbyModeRequested to true when the startup
    process detects the standby.signal file and sets the local variable
    StandbyModeRequested to true.
- Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
    to know whether the server is in standby mode or not.

Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

03 April 2020, 16:26:25

On Thu, 2 Apr 2020 23:55:46 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

> On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:
>> On Thu, 2 Apr 2020 19:38:59 +0900
>> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
[...]
>>> That is, WAL files with .ready files are removed when either
>>> archive_mode!=always in standby mode or archive_mode=off.  
>> 
>> sounds fine to me.

To some extends, it appears to me this sentence was relatively close to my
previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:

XLogArchiveCheckDone(const char *xlog)
{
    [...]
    if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
          (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
          (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
        return true;

Which means that only .done cleanup would occurs during CRASH_RECOVERY
and .ready files might be created if no .done exists. No matter the futur status
of the cluster: primary or standby. Normal shortcut will apply during first
checkpoint after the crash recovery step.

This should handle the case where a backup without backup_label (taken from a
snapshot or after a shutdown with immediate) is restored to build a standby.

Please, find in attachment a third version of my patch
"0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

"0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
untouched. But I'm considering adding some more tests relative to this
discussion.

> BTW, now I'm thinking that the flag in shmem should be updated when
> the startup process sets StandbyModeRequested to true at the beginning
> of the recovery. That is,
> 
> - Add something like SharedStandbyModeRequested into XLogCtl. This field
>     should be initialized with false;
> - Set XLogCtl->SharedStandbyModeRequested to true when the startup
>     process detects the standby.signal file and sets the local variable
>     StandbyModeRequested to true.
> - Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
>     to know whether the server is in standby mode or not.
> 
> Thought?

I try to avoid a new flag in memory with the proposal in attachment of this
email. It seems to me various combinaisons of booleans with subtle differences
around the same subject makes it a bit trappy and complicated to understand.

If my proposal is rejected, I'll be happy to volunteer to add
XLogCtl->SharedStandbyModeRequested though. It seems like a simple enough fix
as well.

Regards,

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Fujii Masao

Date:

03 April 2020, 17:49:50

On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote:
> On Thu, 2 Apr 2020 23:55:46 +0900
> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> 
>> On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:
>>> On Thu, 2 Apr 2020 19:38:59 +0900
>>> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> [...]
>>>> That is, WAL files with .ready files are removed when either
>>>> archive_mode!=always in standby mode or archive_mode=off.
>>>
>>> sounds fine to me.
> 
> To some extends, it appears to me this sentence was relatively close to my
> previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:
> 
> XLogArchiveCheckDone(const char *xlog)
> {
>      [...]
>      if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
>            (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
>            (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
>          return true;
> 
> Which means that only .done cleanup would occurs during CRASH_RECOVERY
> and .ready files might be created if no .done exists. No matter the futur status
> of the cluster: primary or standby. Normal shortcut will apply during first
> checkpoint after the crash recovery step.
> 
> This should handle the case where a backup without backup_label (taken from a
> snapshot or after a shutdown with immediate) is restored to build a standby.
> 
> Please, find in attachment a third version of my patch
> "0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

Thanks for updating the patch! Here are my review comments:

-    bool        SharedRecoveryInProgress;
+    RecoveryState    SharedRecoveryInProgress;

Since the patch changes the meaning of this variable, the name of
the variable should be changed? Otherwise, the current name seems
confusing.

+            SpinLockAcquire(&XLogCtl->info_lck);
+            XLogCtl->SharedRecoveryInProgress = IN_CRASH_RECOVERY;
+            SpinLockRelease(&XLogCtl->info_lck);

As I explained upthread, this code can be reached and IN_CRASH_RECOVERY
can be set even in standby or archive recovery. Is this right behavior that
you're expecting?

Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY
until this code is reached. Also when WAL replay is not necessary
(e.g., restart of the server shutdowed cleanly before), GetRecoveryState()
returns IN_ARCHIVE_RECOVERY because this code is not reached. Aren't
these fragile? If XLogArchiveCheckDone() is only user of GetRecoveryState(),
they would be ok. But if another user will appear in the future, it seems
very easy to mistake. At least those behaviors should be commented in
GetRecoveryState().

-    if (!((XLogArchivingActive() && !inRecovery) ||
-          (XLogArchivingAlways() && inRecovery)))
+    if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
+          (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) &&
+          (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
        return true;

The last condition seems to cause XLogArchiveCheckDone() to return
true in archive recovery mode with archive_mode=on, then cause
unarchived WAL files with .ready to be removed. Is my understanding right?
If yes, that behavior doesn't seem to match with our consensus, i.e.,
WAL files with .ready should not be removed in that case.

+/* Recovery state */
+typedef enum RecoveryState
+{
+    NOT_IN_RECOVERY = 0,
+    IN_CRASH_RECOVERY,
+    IN_ARCHIVE_RECOVERY
+} RecoveryState;

Isn't it better to add more comments here? For example, what does
"Recovery state" mean? Which state is used in standby mode? Why? ,etc.

Is it really ok not to have the value indicating standby mode?

These enum values names are confusing because the variables with
similar names already exist. For example, IN_CRASH_RECOVERY vs.
DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them,
.e.g., by adding the prefix.

> 
> "0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
> untouched. But I'm considering adding some more tests relative to this
> discussion.
> 
>> BTW, now I'm thinking that the flag in shmem should be updated when
>> the startup process sets StandbyModeRequested to true at the beginning
>> of the recovery. That is,
>>
>> - Add something like SharedStandbyModeRequested into XLogCtl. This field
>>      should be initialized with false;
>> - Set XLogCtl->SharedStandbyModeRequested to true when the startup
>>      process detects the standby.signal file and sets the local variable
>>      StandbyModeRequested to true.
>> - Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
>>      to know whether the server is in standby mode or not.
>>
>> Thought?
> 
> I try to avoid a new flag in memory with the proposal in attachment of this
> email. It seems to me various combinaisons of booleans with subtle differences
> around the same subject makes it a bit trappy and complicated to understand.

Ok, so firstly I try to review your patch!

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

07 April 2020, 15:17:36

On Sat, 4 Apr 2020 02:49:50 +0900
Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

> On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 2 Apr 2020 23:55:46 +0900
> > Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >   
> >> On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:  
> >>> On Thu, 2 Apr 2020 19:38:59 +0900
> >>> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:  
> > [...]  
> >>>> That is, WAL files with .ready files are removed when either
> >>>> archive_mode!=always in standby mode or archive_mode=off.  
> >>>
> >>> sounds fine to me.  
> > 
> > To some extends, it appears to me this sentence was relatively close to my
> > previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:
> > 
> > XLogArchiveCheckDone(const char *xlog)
> > {
> >      [...]
> >      if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
> >            (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
> >            (inRecoveryState == IN_ARCHIVE_RECOVERY
> > && !XLogArchivingAlways()))) return true;
> > 
> > Which means that only .done cleanup would occurs during CRASH_RECOVERY
> > and .ready files might be created if no .done exists. No matter the futur
> > status of the cluster: primary or standby. Normal shortcut will apply
> > during first checkpoint after the crash recovery step.
> > 
> > This should handle the case where a backup without backup_label (taken from
> > a snapshot or after a shutdown with immediate) is restored to build a
> > standby.
> > 
> > Please, find in attachment a third version of my patch
> > "0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".  
> 
> Thanks for updating the patch! Here are my review comments:
> 
> -    bool        SharedRecoveryInProgress;
> +    RecoveryState    SharedRecoveryInProgress;
> 
> Since the patch changes the meaning of this variable, the name of
> the variable should be changed? Otherwise, the current name seems
> confusing.

Indeed, fixed using SharedRecoveryState

> +            SpinLockAcquire(&XLogCtl->info_lck);
> +            XLogCtl->SharedRecoveryInProgress =
> IN_CRASH_RECOVERY;
> +            SpinLockRelease(&XLogCtl->info_lck);
> 
> As I explained upthread, this code can be reached and IN_CRASH_RECOVERY
> can be set even in standby or archive recovery. Is this right behavior that
> you're expecting?

Yes. This patch avoids archive cleanup during crash recovery altogether,
whatever the requested status for the cluster.

> Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY
> until this code is reached.

I tried to stick as close as possible with "ControlFile->state" and old
XLogCtl->SharedRecoveryInProgress variables. That's why it's initialized as
IN_ARCHIVE_RECOVERY as XLogCtl->SharedRecoveryInProgress was init as true as
well.

The status itself is set during StartupXLOG when the historical code actually
tries to define and record the real state between DB_IN_ARCHIVE_RECOVERY and
DB_IN_CRASH_RECOVERY.

> Also when WAL replay is not necessary (e.g., restart of the server shutdowed
> cleanly before), GetRecoveryState() returns IN_ARCHIVE_RECOVERY because this
> code is not reached.

It is set to NOT_IN_RECOVERY at the end of StartupXLOG, in the same place we
set ControlFile->state = DB_IN_PRODUCTION. So GetRecoveryState() returns
NOT_IN_RECOVERY as soon as StartupXLOG is done when no WAL replay is necessary.

> Aren't these fragile? If XLogArchiveCheckDone() is only user of
> GetRecoveryState(), they would be ok. But if another user will appear in the
> future, it seems very easy to mistake. At least those behaviors should be
> commented in GetRecoveryState().

We certainly can set SharedRecoveryState earlier, in XLOGShmemInit, based on
the ControlFile->state value. In my understanding, anything different than
DB_SHUTDOWNED or DB_SHUTDOWNED_IN_RECOVERY can be considered as a crash
recovery. Based on this XLOGShmemInit can init SharedRecoveryState to
IN_ARCHIVE_RECOVERY or IN_CRASH_RECOVERY.

With either values, RecoveryInProgress() keep returning the same result so any
current code relying on RecoveryInProgress() is compatible.

I'm not sure who would need this information before the WAL machinery is up,
but is it safe enough in your opinion for futur usage of GetRecoveryState()? Do
you think this information might be useful before the WAL machinery is set?
Current "user" (eg. restoreTwoPhaseData()) only need to know if we are in
recovery, whatever the reason.

The patch in attachment set SharedRecoveryState to either IN_ARCHIVE_RECOVERY
or IN_CRASH_RECOVERY from XLOGShmemInit based on the ControlFile state. It
feels strange though to set this so far away from ControlFile->state
= DB_IN_CRASH_RECOVERY...

> -    if (!((XLogArchivingActive() && !inRecovery) ||
> -          (XLogArchivingAlways() && inRecovery)))
> +    if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
> +          (inRecoveryState == NOT_IN_RECOVERY
> && !XLogArchivingActive()) &&
> +          (inRecoveryState == IN_ARCHIVE_RECOVERY
> && !XLogArchivingAlways()))) return true;
> 
> The last condition seems to cause XLogArchiveCheckDone() to return
> true in archive recovery mode with archive_mode=on, then cause
> unarchived WAL files with .ready to be removed. Is my understanding right?
> If yes, that behavior doesn't seem to match with our consensus, i.e.,
> WAL files with .ready should not be removed in that case.

We wrote:

  >> That is, WAL files with .ready files are removed when either
  >> archive_mode!=always in standby mode or archive_mode=off.  
  > 
  > sounds fine to me.

So if in standby mode and archive_mode is not "always", the .ready files
should be removed.

In current patch, I split the conditions in the sake of clarity.

> +/* Recovery state */
> +typedef enum RecoveryState
> +{
> +    NOT_IN_RECOVERY = 0,
> +    IN_CRASH_RECOVERY,
> +    IN_ARCHIVE_RECOVERY
> +} RecoveryState;
> 
> Isn't it better to add more comments here? For example, what does
> "Recovery state" mean? Which state is used in standby mode? Why? ,etc.

Explanations added.

> Is it really ok not to have the value indicating standby mode?

Unless I'm wrong, this shared state only indicates why we are recovering WAL,
it does not need reflect the status of the instance in shared memory.
StandbyMode is already available for the startup process. Would it be useful
to share this mode outside of the startup process?

> These enum values names are confusing because the variables with
> similar names already exist. For example, IN_CRASH_RECOVERY vs.
> DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them,
> .e.g., by adding the prefix.

Well, I lack of imagination. So I picked CRASH_RECOVERING and
ARCHIVE_RECOVERING.

> > "0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
> > untouched. But I'm considering adding some more tests relative to this
> > discussion.
> >   
> >> BTW, now I'm thinking that the flag in shmem should be updated when
> >> the startup process sets StandbyModeRequested to true at the beginning
> >> of the recovery. That is,
> >>
> >> - Add something like SharedStandbyModeRequested into XLogCtl. This field
> >>      should be initialized with false;
> >> - Set XLogCtl->SharedStandbyModeRequested to true when the startup
> >>      process detects the standby.signal file and sets the local variable
> >>      StandbyModeRequested to true.
> >> - Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
> >>      to know whether the server is in standby mode or not.
> >>
> >> Thought?  
> > 
> > I try to avoid a new flag in memory with the proposal in attachment of this
> > email. It seems to me various combinaisons of booleans with subtle
> > differences around the same subject makes it a bit trappy and complicated
> > to understand.  
> 
> Ok, so firstly I try to review your patch!

Thank you for your review and help!

In attachment:
* fix proposal: 0001-v4-Fix-WAL-retention-during-production-crash-recovery.patch
* add test: 0002-v2-Add-test-on-non-archived-WAL-during-crash-recovery.patch
  0002-v2 fix conflict with current master.

Regards,

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

08 April 2020, 02:23:45

On Tue, Apr 07, 2020 at 05:17:36PM +0200, Jehan-Guillaume de Rorthais wrote:
> I'm not sure who would need this information before the WAL machinery is up,
> but is it safe enough in your opinion for futur usage of GetRecoveryState()? Do
> you think this information might be useful before the WAL machinery is set?
> Current "user" (eg. restoreTwoPhaseData()) only need to know if we are in
> recovery, whatever the reason.

(I had this thread marked as something to look at, but could not.
Sorry for the delay).

>  src/test/recovery/t/011_crash_recovery.pl | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
> index ca6e92b50d..ce2e899891 100644
> --- a/src/test/recovery/t/011_crash_recovery.pl
> +++ b/src/test/recovery/t/011_crash_recovery.pl
> @@ -15,11 +15,17 @@ if ($Config{osname} eq 'MSWin32')

May I ask why this new test is added to 011_crash_recovery.pl which is
aimed at testing crash and redo, while we have 002_archiving.pl that
is dedicated to archiving in a more general manner?
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

08 April 2020, 08:39:09

At Tue, 7 Apr 2020 17:17:36 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> > +/* Recovery state */
> > +typedef enum RecoveryState
> > +{
> > +    NOT_IN_RECOVERY = 0,
> > +    IN_CRASH_RECOVERY,
> > +    IN_ARCHIVE_RECOVERY
> > +} RecoveryState;

I'm not sure the complexity is required here. Are we asuume that
archive_mode can be changed before restarting?

At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> > Ok, so our *current* consensus seems the followings. Right?
> > 
> > - If archive_mode=off, any WAL files with .ready files are removed in
> >     crash recovery, archive recoery and standby mode.
> 
> yes

If archive_mode = off no WAL files are marked as ".ready".

> > - If archive_mode=on, WAL files with .ready files are removed only in
> >     standby mode. In crash recovery and archive recovery cases, they keep
> >     remaining and would be archived after recovery finishes (i.e., during
> >     normal processing).
> 
> yes
>
> > - If archive_mode=always, in crash recovery, archive recovery and
> >     standby mode, WAL files with .ready files are archived if WAL archiver
> >     is running.
> 
> yes

So if we assume archive_mode won't be changed after a crash before
restarting, if archive_mode = on on a standy, WAL files are not marked
as ".ready". If it is "always", WAL files that are to be archived are
marked as ".ready".  Finally, the condition reduces to:

If archiver is running, archive ".ready" files. Otherwise ignore
".ready" and just remove WAL files after use.
> 
> > That is, WAL files with .ready files are removed when either
> > archive_mode!=always in standby mode or archive_mode=off.
> 
> sounds fine to me.

That situation implies that archive_mode has been changed.  Can we
guarantee the completeness of the archive in the case?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

08 April 2020, 11:58:30

On Wed, 8 Apr 2020 11:23:45 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Tue, Apr 07, 2020 at 05:17:36PM +0200, Jehan-Guillaume de Rorthais wrote:
> > I'm not sure who would need this information before the WAL machinery is up,
> > but is it safe enough in your opinion for futur usage of
> > GetRecoveryState()? Do you think this information might be useful before
> > the WAL machinery is set? Current "user" (eg. restoreTwoPhaseData()) only
> > need to know if we are in recovery, whatever the reason.  
> 
> (I had this thread marked as something to look at, but could not.
> Sorry for the delay).

(no worries :))

> >  src/test/recovery/t/011_crash_recovery.pl | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/src/test/recovery/t/011_crash_recovery.pl
> > b/src/test/recovery/t/011_crash_recovery.pl index ca6e92b50d..ce2e899891
> > 100644 --- a/src/test/recovery/t/011_crash_recovery.pl
> > +++ b/src/test/recovery/t/011_crash_recovery.pl
> > @@ -15,11 +15,17 @@ if ($Config{osname} eq 'MSWin32')  
> 
> May I ask why this new test is added to 011_crash_recovery.pl which is
> aimed at testing crash and redo, while we have 002_archiving.pl that
> is dedicated to archiving in a more general manner?

I thought it was a better place because the test happen during crash recovery.

In the meantime, while working on other tests related to $SUBJECT and the
current consensus, I was wondering if a new file would be a better place anyway.

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

08 April 2020, 13:26:03

On Wed, 08 Apr 2020 17:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Tue, 7 Apr 2020 17:17:36 +0200, Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote in 
> > > +/* Recovery state */
> > > +typedef enum RecoveryState
> > > +{
> > > +    NOT_IN_RECOVERY = 0,
> > > +    IN_CRASH_RECOVERY,
> > > +    IN_ARCHIVE_RECOVERY
> > > +} RecoveryState;  
> 
> I'm not sure the complexity is required here. Are we asuume that
> archive_mode can be changed before restarting?

I assume it can yes. Eg., one can restore a PITR backup as a standby and change
the value of archive_mode to either off, on or always.

> At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
> <jgdr@dalibo.com> wrote in 
> > > Ok, so our *current* consensus seems the followings. Right?
> > > 
> > > - If archive_mode=off, any WAL files with .ready files are removed in
> > >     crash recovery, archive recoery and standby mode.  
> > 
> > yes  
> 
> If archive_mode = off no WAL files are marked as ".ready".

Sure, on the primary side.

What if you build a standby from a backup with archive_mode=on with
some .ready files in there? 

> > > - If archive_mode=on, WAL files with .ready files are removed only in
> > >     standby mode. In crash recovery and archive recovery cases, they keep
> > >     remaining and would be archived after recovery finishes (i.e., during
> > >     normal processing).  
> > 
> > yes
> >  
> > > - If archive_mode=always, in crash recovery, archive recovery and
> > >     standby mode, WAL files with .ready files are archived if WAL archiver
> > >     is running.  
> > 
> > yes  
> 
> So if we assume archive_mode won't be changed after a crash before
> restarting, if archive_mode = on on a standy, WAL files are not marked
> as ".ready".

.ready files can be inherited from the old primary when building the standby,
depending on the method. See previous explanations from Fujii-san:

https://www.postgresql.org/message-id/flat/ca964b3a-61a0-902e-c7b3-3abbc01a921f%40oss.nttdata.com#ddd6cbad6c5e576e2e1ae53868ca3eea

> If it is "always", WAL files that are to be archived are
> marked as ".ready".  Finally, the condition reduces to:
> 
> If archiver is running, archive ".ready" files. Otherwise ignore
> ".ready" and just remove WAL files after use.
> >   
> > > That is, WAL files with .ready files are removed when either
> > > archive_mode!=always in standby mode or archive_mode=off.  
> > 
> > sounds fine to me.  
> 
> That situation implies that archive_mode has been changed.

Why? archive_mode may have been "always" on the primary when eg. a snapshot has
been created.

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

09 April 2020, 02:26:57

Hello, Jehan.

At Wed, 8 Apr 2020 15:26:03 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> On Wed, 08 Apr 2020 17:39:09 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> 
> > At Tue, 7 Apr 2020 17:17:36 +0200, Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote in 
> > > > +/* Recovery state */
> > > > +typedef enum RecoveryState
> > > > +{
> > > > +    NOT_IN_RECOVERY = 0,
> > > > +    IN_CRASH_RECOVERY,
> > > > +    IN_ARCHIVE_RECOVERY
> > > > +} RecoveryState;  
> > 
> > I'm not sure the complexity is required here. Are we asuume that
> > archive_mode can be changed before restarting?
> 
> I assume it can yes. Eg., one can restore a PITR backup as a standby and change
> the value of archive_mode to either off, on or always.

Thanks. I was confused. The original issue was restarted master can
miss files in archive.  To fix that, it's sufficient not ignoring
.ready. It is more than that.

> > At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
> > <jgdr@dalibo.com> wrote in 
> > > > Ok, so our *current* consensus seems the followings. Right?
> > > > 
> > > > - If archive_mode=off, any WAL files with .ready files are removed in
> > > >     crash recovery, archive recoery and standby mode.  
> > > 
> > > yes  
> > 
> > If archive_mode = off no WAL files are marked as ".ready".
> 
> Sure, on the primary side.
> 
> What if you build a standby from a backup with archive_mode=on with
> some .ready files in there? 

Well. Backup doesn't have nothing in archive_status directory if it is
taken by pg_basebackup. If the backup is created other way, it can
have some (as Fujii-san mentioned).  Master with archive_mode != off
and standby with archive_mode=always should archive WAL files that are
not marked .done, but standby with archive_mode == on should not. The
commit intended that but the mistake here is it thinks that inRecovery
represents whether it is running as a standby or not, but actually it
is true on primary during crash recovery.

On the other hand, with the patch, standby with archive_mode=on
wrongly archives WAL files during crash recovery.

What we should check there is, as the commit was intended, not whether
it is under crash or archive recovery, but whether it is running as
primary or standby.

> > If it is "always", WAL files that are to be archived are
> > marked as ".ready".  Finally, the condition reduces to:
> > 
> > If archiver is running, archive ".ready" files. Otherwise ignore
> > ".ready" and just remove WAL files after use.
> > >   
> > > > That is, WAL files with .ready files are removed when either
> > > > archive_mode!=always in standby mode or archive_mode=off.  
> > > 
> > > sounds fine to me.  
> > 
> > That situation implies that archive_mode has been changed.
> 
> Why? archive_mode may have been "always" on the primary when eg. a snapshot has
> been created.

.ready files are created only when archive_mode != off.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

09 April 2020, 09:35:12

On Thu, 09 Apr 2020 11:26:57 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]
> > > At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
> > > <jgdr@dalibo.com> wrote in   
> > > > > Ok, so our *current* consensus seems the followings. Right?
> > > > > 
> > > > > - If archive_mode=off, any WAL files with .ready files are removed in
> > > > >     crash recovery, archive recoery and standby mode.    
> > > > 
> > > > yes    
> > > 
> > > If archive_mode = off no WAL files are marked as ".ready".  
> > 
> > Sure, on the primary side.
> > 
> > What if you build a standby from a backup with archive_mode=on with
> > some .ready files in there?   
> 
> Well. Backup doesn't have nothing in archive_status directory if it is
> taken by pg_basebackup. If the backup is created other way, it can
> have some (as Fujii-san mentioned).  Master with archive_mode != off
> and standby with archive_mode=always should archive WAL files that are
> not marked .done, but standby with archive_mode == on should not. The
> commit intended that

Unless I'm wrong, the commit avoids creating .ready files on standby when a WAL
has neither .done or .ready status file.

> but the mistake here is it thinks that inRecovery represents whether it is
> running as a standby or not, but actually it is true on primary during crash
> recovery.

Indeed.

> On the other hand, with the patch, standby with archive_mode=on
> wrongly archives WAL files during crash recovery.

"without the patch" you mean? You are talking about 78ea8b5daab, right?

> What we should check there is, as the commit was intended, not whether
> it is under crash or archive recovery, but whether it is running as
> primary or standby.

Yes.

> > > If it is "always", WAL files that are to be archived are
> > > marked as ".ready".  Finally, the condition reduces to:
> > > 
> > > If archiver is running, archive ".ready" files. Otherwise ignore
> > > ".ready" and just remove WAL files after use.  
> > > >     
> > > > > That is, WAL files with .ready files are removed when either
> > > > > archive_mode!=always in standby mode or archive_mode=off.    
> > > > 
> > > > sounds fine to me.    
> > > 
> > > That situation implies that archive_mode has been changed.  
> > 
> > Why? archive_mode may have been "always" on the primary when eg. a snapshot
> > has been created.  
> 
> .ready files are created only when archive_mode != off.

Yes, on a primary, they are created when archive_mode > off. On standby, when
archive_mode=always. If a primary had archive_mode=always, a standby created
from its backup will still have archive_mode=always, with no changes.
Maybe I miss your point, sorry.

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

09 April 2020, 16:46:22

On Wed, 8 Apr 2020 13:58:30 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
[...]
> > May I ask why this new test is added to 011_crash_recovery.pl which is
> > aimed at testing crash and redo, while we have 002_archiving.pl that
> > is dedicated to archiving in a more general manner?  
> 
> I thought it was a better place because the test happen during crash recovery.
> 
> In the meantime, while working on other tests related to $SUBJECT and the
> current consensus, I was wondering if a new file would be a better place
> anyway.

So, 002_archiving.pl deals more with testing recovering on hot standby side
than archiving. Maybe it could be renamed?

While discussing this, I created a new file to add some more tests about WAL
archiving and how recovery deal with them. Please, find the patch in
attachment. I'll be able to move them elsewhere later, depending on the
conclusions of this discussion.

Regards,

Attachment

0002-v3-Add-tests-relative-to-WAL-archiving.patch

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

10 April 2020, 02:00:31

By the way, I haven't noticed that Cc: didn't contain -hackers.  Added
it.


At Thu, 9 Apr 2020 11:35:12 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> On Thu, 09 Apr 2020 11:26:57 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> [...]
> > > > At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
> > > > <jgdr@dalibo.com> wrote in   
> > > > > > Ok, so our *current* consensus seems the followings. Right?
> > > > > > 
> > > > > > - If archive_mode=off, any WAL files with .ready files are removed in
> > > > > >     crash recovery, archive recoery and standby mode.    
> > > > > 
> > > > > yes    
> > > > 
> > > > If archive_mode = off no WAL files are marked as ".ready".  
> > > 
> > > Sure, on the primary side.
> > > 
> > > What if you build a standby from a backup with archive_mode=on with
> > > some .ready files in there?   
> > 
> > Well. Backup doesn't have nothing in archive_status directory if it is
> > taken by pg_basebackup. If the backup is created other way, it can
> > have some (as Fujii-san mentioned).  Master with archive_mode != off
> > and standby with archive_mode=always should archive WAL files that are
> > not marked .done, but standby with archive_mode == on should not. The
> > commit intended that
> 
> Unless I'm wrong, the commit avoids creating .ready files on standby when a WAL
> has neither .done or .ready status file.

Right.

> > but the mistake here is it thinks that inRecovery represents whether it is
> > running as a standby or not, but actually it is true on primary during crash
> > recovery.
> 
> Indeed.
> 
> > On the other hand, with the patch, standby with archive_mode=on
> > wrongly archives WAL files during crash recovery.
> 
> "without the patch" you mean? You are talking about 78ea8b5daab, right?

No. I menat the v4 patch in [1].

[1] https://www.postgresql.org/message-id/20200407171736.61906608%40firost

Prior to appllying the patch (that is the commit 78ea..),
XLogArchiveCheckDone() correctly returns true (= allow to remove it)
for the same condition.

The proposed patch does the folloing thing.

if (!XLogArchivingActive() ||
    recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways())
    return true;

It doesn't return for the condition "recoveryState=CRASH_RECOVERING
and archive_mode = on". Then the WAL files is mitakenly marked as
".ready" if not marked yet.

By the way, the code seems not following the convention a bit
here. Let the inserting code be in the same style to the existing code
around.

+    if ( ! XLogArchivingActive() )

I think we don't put the  spaces within the parentheses above.

| ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY

The first two and the last one are in different style. *I* prefer them
(if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY".

> > What we should check there is, as the commit was intended, not whether
> > it is under crash or archive recovery, but whether it is running as
> > primary or standby.
> 
> Yes.
> 
> > > > If it is "always", WAL files that are to be archived are
> > > > marked as ".ready".  Finally, the condition reduces to:
> > > > 
> > > > If archiver is running, archive ".ready" files. Otherwise ignore
> > > > ".ready" and just remove WAL files after use.  
> > > > >     
> > > > > > That is, WAL files with .ready files are removed when either
> > > > > > archive_mode!=always in standby mode or archive_mode=off.    
> > > > > 
> > > > > sounds fine to me.    
> > > > 
> > > > That situation implies that archive_mode has been changed.  
> > > 
> > > Why? archive_mode may have been "always" on the primary when eg. a snapshot
> > > has been created.  
> > 
> > .ready files are created only when archive_mode != off.
> 
> Yes, on a primary, they are created when archive_mode > off. On standby, when
> archive_mode=always. If a primary had archive_mode=always, a standby created
> from its backup will still have archive_mode=always, with no changes.
> Maybe I miss your point, sorry.

Sorry, it was ambiguous.

> That is, WAL files with .ready files are removed when either
> archive_mode!=always in standby mode or archive_mode=off.    

If we have .ready file when archive_mode = off, the cluster (or the
original of the copy cluster) should have been running in archive = on
or always. That is, archive_mode has been changed. But anyway that
discussion would not be in much relevance.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

10 April 2020, 02:00:31

By the way, I haven't noticed that Cc: didn't contain -hackers.  Added
it.


At Thu, 9 Apr 2020 11:35:12 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> On Thu, 09 Apr 2020 11:26:57 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> [...]
> > > > At Thu, 2 Apr 2020 15:49:15 +0200, Jehan-Guillaume de Rorthais
> > > > <jgdr@dalibo.com> wrote in   
> > > > > > Ok, so our *current* consensus seems the followings. Right?
> > > > > > 
> > > > > > - If archive_mode=off, any WAL files with .ready files are removed in
> > > > > >     crash recovery, archive recoery and standby mode.    
> > > > > 
> > > > > yes    
> > > > 
> > > > If archive_mode = off no WAL files are marked as ".ready".  
> > > 
> > > Sure, on the primary side.
> > > 
> > > What if you build a standby from a backup with archive_mode=on with
> > > some .ready files in there?   
> > 
> > Well. Backup doesn't have nothing in archive_status directory if it is
> > taken by pg_basebackup. If the backup is created other way, it can
> > have some (as Fujii-san mentioned).  Master with archive_mode != off
> > and standby with archive_mode=always should archive WAL files that are
> > not marked .done, but standby with archive_mode == on should not. The
> > commit intended that
> 
> Unless I'm wrong, the commit avoids creating .ready files on standby when a WAL
> has neither .done or .ready status file.

Right.

> > but the mistake here is it thinks that inRecovery represents whether it is
> > running as a standby or not, but actually it is true on primary during crash
> > recovery.
> 
> Indeed.
> 
> > On the other hand, with the patch, standby with archive_mode=on
> > wrongly archives WAL files during crash recovery.
> 
> "without the patch" you mean? You are talking about 78ea8b5daab, right?

No. I menat the v4 patch in [1].

[1] https://www.postgresql.org/message-id/20200407171736.61906608%40firost

Prior to appllying the patch (that is the commit 78ea..),
XLogArchiveCheckDone() correctly returns true (= allow to remove it)
for the same condition.

The proposed patch does the folloing thing.

if (!XLogArchivingActive() ||
    recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways())
    return true;

It doesn't return for the condition "recoveryState=CRASH_RECOVERING
and archive_mode = on". Then the WAL files is mitakenly marked as
".ready" if not marked yet.

By the way, the code seems not following the convention a bit
here. Let the inserting code be in the same style to the existing code
around.

+    if ( ! XLogArchivingActive() )

I think we don't put the  spaces within the parentheses above.

| ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY

The first two and the last one are in different style. *I* prefer them
(if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY".

> > What we should check there is, as the commit was intended, not whether
> > it is under crash or archive recovery, but whether it is running as
> > primary or standby.
> 
> Yes.
> 
> > > > If it is "always", WAL files that are to be archived are
> > > > marked as ".ready".  Finally, the condition reduces to:
> > > > 
> > > > If archiver is running, archive ".ready" files. Otherwise ignore
> > > > ".ready" and just remove WAL files after use.  
> > > > >     
> > > > > > That is, WAL files with .ready files are removed when either
> > > > > > archive_mode!=always in standby mode or archive_mode=off.    
> > > > > 
> > > > > sounds fine to me.    
> > > > 
> > > > That situation implies that archive_mode has been changed.  
> > > 
> > > Why? archive_mode may have been "always" on the primary when eg. a snapshot
> > > has been created.  
> > 
> > .ready files are created only when archive_mode != off.
> 
> Yes, on a primary, they are created when archive_mode > off. On standby, when
> archive_mode=always. If a primary had archive_mode=always, a standby created
> from its backup will still have archive_mode=always, with no changes.
> Maybe I miss your point, sorry.

Sorry, it was ambiguous.

> That is, WAL files with .ready files are removed when either
> archive_mode!=always in standby mode or archive_mode=off.    

If we have .ready file when archive_mode = off, the cluster (or the
original of the copy cluster) should have been running in archive = on
or always. That is, archive_mode has been changed. But anyway that
discussion would not be in much relevance.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

10 April 2020, 02:14:54

At Thu, 9 Apr 2020 18:46:22 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in 
> On Wed, 8 Apr 2020 13:58:30 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> [...]
> > > May I ask why this new test is added to 011_crash_recovery.pl which is
> > > aimed at testing crash and redo, while we have 002_archiving.pl that
> > > is dedicated to archiving in a more general manner?  
> > 
> > I thought it was a better place because the test happen during crash recovery.
> > 
> > In the meantime, while working on other tests related to $SUBJECT and the
> > current consensus, I was wondering if a new file would be a better place
> > anyway.
> 
> So, 002_archiving.pl deals more with testing recovering on hot standby side
> than archiving. Maybe it could be renamed?

I have the same feeling with Michael.  The test that archives are
created correctly seems to fit the file.  It would be unintentionally
that the file is not exercising archiving so much.

> While discussing this, I created a new file to add some more tests about WAL
> archiving and how recovery deal with them. Please, find the patch in
> attachment. I'll be able to move them elsewhere later, depending on the
> conclusions of this discussion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

13 April 2020, 07:14:14

On Fri, Apr 10, 2020 at 11:14:54AM +0900, Kyotaro Horiguchi wrote:
> I have the same feeling with Michael.  The test that archives are
> created correctly seems to fit the file.  It would be unintentionally
> that the file is not exercising archiving so much.

I have been finally looking at this thread and the latest patch set,
sorry for the late reply.

    XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
-   XLogCtl->SharedRecoveryInProgress = true;
    XLogCtl->SharedHotStandbyActive = false;
    XLogCtl->SharedPromoteIsTriggered = false;
    XLogCtl->WalWriterSleeping = false;
[...]
+   switch (ControlFile->state)
+   {
+       case DB_SHUTDOWNED:
+       case DB_SHUTDOWNED_IN_RECOVERY:
+           XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+           break;
+       default:
+           XLogCtl->SharedRecoveryState = CRASH_RECOVERING;
+   }
It seems to me that the initial value of SharedRecoveryState should be
CRASH_RECOVERING all the time no?  StartupXLOG() is a code path taken
even if starting cleanly, and the flag would be reset correctly at the
end to NOT_IN_RECOVERY.

+typedef enum RecoveryState
+{
+   NOT_IN_RECOVERY = 0, /* currently in production */
+   CRASH_RECOVERING,    /* recovering from a crash */
+   ARCHIVE_RECOVERING   /* recovering archives as requested */
+} RecoveryState;
I also have some issues with the name of those variables, here is an
idea for the three states:
- RECOVERY_STATE_CRASH
- RECOVERT_STATE_ARCHIVE
- RECOVERY_STATE_NONE
I would recommend to use the same suffix for those variables to ease
grepping.

 /*
- * Local copy of SharedRecoveryInProgress variable. True actually means "not
- * known, need to check the shared state".
+ * This is false when SharedRecoveryState is not ARCHIVE_RECOVERING.
+ * True actually means "not known, need to check the shared state".
  */
A double negation sounds wrong to me.  And actually this variable is
false when the shared state is set to NOT_IN_RECOVERY, and true when
the state is either CRASH_RECOVERING or ARCHIVE_RECOVERING because it
means that recovery is running, be it archive recovery or crash
recovery, so the comment is wrong.

+   /* The file is always deletable if archive_mode is "off". */
+   if ( ! XLogArchivingActive() )
+       return true;
[...]
+   if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
        return true;
Incorrect indentation.

                UpdateControlFile();
+
+               SpinLockAcquire(&XLogCtl->info_lck);
+               XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
+               SpinLockRelease(&XLogCtl->info_lck);
+
                LWLockRelease(ControlFileLock);
Here, the shared flag is updated while holding ControlFileLock to
ensure a consistent state of things within shared memory, so it would
be nice to add a comment about that.

+RecoveryState
+GetRecoveryState(void)
+{
+   volatile XLogCtlData *xlogctl = XLogCtl;
+
+   return xlogctl->SharedRecoveryState;
+}
Er, you need to acquire info_lck to look at the state here, no?

    /*
-    * The file is always deletable if archive_mode is "off".  On standbys
-    * archiving is disabled if archive_mode is "on", and enabled with
-    * "always".  On a primary, archiving is enabled if archive_mode is "on"
-    * or "always".
+    * On standbys, the file is deletable if archive_mode is not
+    * "always".
     */
It would be good to mention that while in crash recovery, files are
not considered as deletable.

I agree that having a separate .pl file for the tests of this thread
is just cleaner.  Here are more comments about these.

+# temporary fail archive_command for futur tests
+$node->safe_psql('postgres', q{
+   ALTER SYSTEM SET archive_command TO 'false';
+   SELECT pg_reload_conf();
+});
That's likely portable on Windows even if you skip the tests there,
and I am not sure that it is a good idea to rely on it being in PATH.
Depending on the system, the path of the command is also likely going
to be different.  As here the goal is to prevent the archiver to do
its work, why not relying on the configuration where archive_mode is
set but archive_command is not?  This would cause the archiver to be a
no-op process, and .ready files will remain around.  You could then
replace the lookup of pg_stat_archiver with poll_query_until() and a
query that makes use of pg_stat_file() to make sure that the .ready
exists when needed.

+ ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
+   "WAL still ready to archive in archive_status");
It would be good to mention in the description the check applies to a
primary.

+# test recovery without archive_mode=always does not keep .ready WALs
+$standby1 = get_new_node('standby');
+$standby1->init_from_backup($node, 'backup', has_restoring => 1);
+$standby1_data = $standby1->data_dir;
+$standby1->start;
+$standby1->safe_psql('postgres', q{CHECKPOINT});
For readability archive_mode = on should be added to the configuration
file?  Okay, this is inherited from the primary, still that would
avoid any issues with this code is refactored in some way.

"WAL waiting to be archived in backup removed with archive_mode=on
  on standby" );
That should be "WAL segment" or "WAL file", but not WAL.

Regarding the tests on a standby, it seems to me that the following
is necessary:
1) Test that with archive_mode = on, segments are not marked with
.ready.
2) Test that with archive_mode = always, segments are marked with
.ready during archive recovery.
3) Test that with archive_mode = always, segments are not removed
during crash recovery.
I can see tests for 1) and 2), but not 3).  Could you add a
stop('immediate')+start() for $standby2 at the end of
020_archive_status.pl and check that the .ready file is still there
after crash recovery?  The end of the tests actually relies on the
fact that archive_command is set to "false" when the cold backup is
taken, before resetting it.  I think that it would be cleaner to
enforce the configuration you want to test before starting each
standby.  It becomes easier to understand the flow of the test for the
reader.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

14 April 2020, 16:03:19

On Mon, 13 Apr 2020 16:14:14 +0900
Michael Paquier <michael@paquier.xyz> wrote:
[...]
>     XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
> -   XLogCtl->SharedRecoveryInProgress = true;
>     XLogCtl->SharedHotStandbyActive = false;
>     XLogCtl->SharedPromoteIsTriggered = false;
>     XLogCtl->WalWriterSleeping = false;
> [...]
> +   switch (ControlFile->state)
> +   {
> +       case DB_SHUTDOWNED:
> +       case DB_SHUTDOWNED_IN_RECOVERY:
> +           XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
> +           break;
> +       default:
> +           XLogCtl->SharedRecoveryState = CRASH_RECOVERING;
> +   }
> It seems to me that the initial value of SharedRecoveryState should be
> CRASH_RECOVERING all the time no?  StartupXLOG() is a code path taken
> even if starting cleanly, and the flag would be reset correctly at the
> end to NOT_IN_RECOVERY.

Previous version of the patch was setting CRASH_RECOVERING. Fujii-san reported
(the 4 Apr 2020 02:49:50 +0900) that it could be useful to expose a better value
until relevant code is reached in StartupXLOG() so GetRecoveryState() returns
a safer value for futur use.

As I answered upthread, I'm not sure who would need this information before the
WAL machinery is up though. Note that ARCHIVE_RECOVERING and CRASH_RECOVERING
are compatible with the previous behavior.

Maybe the solution would be to init with CRASH_RECOVERING and add some comment
in GetRecoveryState() to warn the state is "enforced" after the XLOG machinery
is started and is init'ed to RECOVERING in the meantime?

I initialized it to CRASH_RECOVERING in the new v5 patch and added a comment
to GetRecoveryState().

> +typedef enum RecoveryState
> +{
> +   NOT_IN_RECOVERY = 0, /* currently in production */
> +   CRASH_RECOVERING,    /* recovering from a crash */
> +   ARCHIVE_RECOVERING   /* recovering archives as requested */
> +} RecoveryState;
> I also have some issues with the name of those variables, here is an
> idea for the three states:
> - RECOVERY_STATE_CRASH
> - RECOVERT_STATE_ARCHIVE
> - RECOVERY_STATE_NONE
> I would recommend to use the same suffix for those variables to ease
> grepping.

Sounds really good to me. Thanks!

>  /*
> - * Local copy of SharedRecoveryInProgress variable. True actually means "not
> - * known, need to check the shared state".
> + * This is false when SharedRecoveryState is not ARCHIVE_RECOVERING.
> + * True actually means "not known, need to check the shared state".
>   */
> A double negation sounds wrong to me.  And actually this variable is
> false when the shared state is set to NOT_IN_RECOVERY, and true when
> the state is either CRASH_RECOVERING or ARCHIVE_RECOVERING because it
> means that recovery is running, be it archive recovery or crash
> recovery, so the comment is wrong.

Indeed, sorry. Fixed.

> +   /* The file is always deletable if archive_mode is "off". */
> +   if ( ! XLogArchivingActive() )
> +       return true;
> [...]
> +   if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
>         return true;
> Incorrect indentation.

Is it the spaces as reported by Horiguchi-san? I removed them in latest patch.

>                 UpdateControlFile();
> +
> +               SpinLockAcquire(&XLogCtl->info_lck);
> +               XLogCtl->SharedRecoveryState = ARCHIVE_RECOVERING;
> +               SpinLockRelease(&XLogCtl->info_lck);
> +
>                 LWLockRelease(ControlFileLock);
> Here, the shared flag is updated while holding ControlFileLock to
> ensure a consistent state of things within shared memory, so it would
> be nice to add a comment about that.

Indeed. the original code had no such comment and I asked myself the same
question. Added.

> +RecoveryState
> +GetRecoveryState(void)
> +{
> +   volatile XLogCtlData *xlogctl = XLogCtl;
> +
> +   return xlogctl->SharedRecoveryState;
> +}
> Er, you need to acquire info_lck to look at the state here, no?

Yes, fixed.

>     /*
> -    * The file is always deletable if archive_mode is "off".  On standbys
> -    * archiving is disabled if archive_mode is "on", and enabled with
> -    * "always".  On a primary, archiving is enabled if archive_mode is "on"
> -    * or "always".
> +    * On standbys, the file is deletable if archive_mode is not
> +    * "always".
>      */
> It would be good to mention that while in crash recovery, files are
> not considered as deletable.

Well, in fact, I am still wondering about this. I was hesitant to add a
shortcut like:

  /* no WAL segment cleanup during crash recovery */
  if (recoveryState == RECOVERT_STATE_CRASH)
      return false;

But, what if for example we crashed for lack of disk space during intensive
wirtes? During crash recovery, any WAL marked as .done could be removed and
allow the system to start again and maybe make even further WAL cleanup by
archiving some more WAL without competing with previous high write ratio.

When we recover a primary, this behavior seems conform with any value of
archive_mode, even if it has been changed after crash and before starting it.
On a standby, we might create .ready files, but they will be removed during the
first restartpoint if needed.

> I agree that having a separate .pl file for the tests of this thread
> is just cleaner.  Here are more comments about these.
> 
> +# temporary fail archive_command for futur tests
> +$node->safe_psql('postgres', q{
> +   ALTER SYSTEM SET archive_command TO 'false';
> +   SELECT pg_reload_conf();
> +});
> That's likely portable on Windows even if you skip the tests there,
> and I am not sure that it is a good idea to rely on it being in PATH.
> Depending on the system, the path of the command is also likely going
> to be different.  As here the goal is to prevent the archiver to do
> its work, why not relying on the configuration where archive_mode is
> set but archive_command is not?  This would cause the archiver to be a
> no-op process, and .ready files will remain around.  You could then
> replace the lookup of pg_stat_archiver with poll_query_until() and a
> query that makes use of pg_stat_file() to make sure that the .ready
> exists when needed.

I needed a failure so I can test pg_stat_archiver reports it as well. In my
mind, using "false" would either trigger a failure because false returns 1 or...
a failure because the command is not found. In either way, the result is the
same.

Using poll_query_until+pg_stat_file, is a good idea, but not enough as
archiver reports a failure some moment after the .ready signal file appears.

> + ok( -f "$node_data/pg_wal/archive_status/000000010000000000000001.ready",
> +   "WAL still ready to archive in archive_status");
> It would be good to mention in the description the check applies to a
> primary.

done

> +# test recovery without archive_mode=always does not keep .ready WALs
> +$standby1 = get_new_node('standby');
> +$standby1->init_from_backup($node, 'backup', has_restoring => 1);
> +$standby1_data = $standby1->data_dir;
> +$standby1->start;
> +$standby1->safe_psql('postgres', q{CHECKPOINT});
> For readability archive_mode = on should be added to the configuration
> file?  Okay, this is inherited from the primary, still that would
> avoid any issues with this code is refactored in some way.

added

> "WAL waiting to be archived in backup removed with archive_mode=on
>   on
> standby" ); That should be "WAL segment" or "WAL file", but not WAL.

updated everywhere.

> Regarding the tests on a standby, it seems to me that the following
> is necessary:
> 1) Test that with archive_mode = on, segments are not marked with
> .ready.
> 2) Test that with archive_mode = always, segments are marked with
> .ready during archive recovery.
> 3) Test that with archive_mode = always, segments are not removed
> during crash recovery.
> I can see tests for 1) and 2),

Not really. The current tests does not check that segments created *after* the
backup are marked or not with .ready file on standby. I added these tests plus
some various other ones.

> but not 3).  Could you add a
> stop('immediate')+start() for $standby2 at the end of
> 020_archive_status.pl and check that the .ready file is still there
> after crash recovery?

done.

> The end of the tests actually relies on the
> fact that archive_command is set to "false" when the cold backup is
> taken, before resetting it.  I think that it would be cleaner to
> enforce the configuration you want to test before starting each
> standby.  It becomes easier to understand the flow of the test for the
> reader.

done as well.

Thank you for your review!

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

14 April 2020, 16:09:23

On Fri, 10 Apr 2020 11:00:31 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]
> > > but the mistake here is it thinks that inRecovery represents whether it is
> > > running as a standby or not, but actually it is true on primary during
> > > crash recovery.  
> > 
> > Indeed.
> >   
> > > On the other hand, with the patch, standby with archive_mode=on
> > > wrongly archives WAL files during crash recovery.  
> > 
> > "without the patch" you mean? You are talking about 78ea8b5daab, right?  
> 
> No. I menat the v4 patch in [1].
> 
> [1] https://www.postgresql.org/message-id/20200407171736.61906608%40firost
> 
> Prior to appllying the patch (that is the commit 78ea..),
> XLogArchiveCheckDone() correctly returns true (= allow to remove it)
> for the same condition.
> 
> The proposed patch does the folloing thing.
> 
> if (!XLogArchivingActive() ||
>     recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways())
>     return true;
> 
> It doesn't return for the condition "recoveryState=CRASH_RECOVERING
> and archive_mode = on". Then the WAL files is mitakenly marked as
> ".ready" if not marked yet.

Indeed.

But .ready files are then deleted during the first restartpoint. I'm not sure
how to fix this behavior without making the code too complex.

This is discussed in my last answer to Michael few minutes ago as well.

> By the way, the code seems not following the convention a bit
> here. Let the inserting code be in the same style to the existing code
> around.
> 
> +    if ( ! XLogArchivingActive() )
> 
> I think we don't put the  spaces within the parentheses above.

Indeed, This is fixed in patch v5 sent few minutes ago.

> | ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY
> 
> The first two and the last one are in different style. *I* prefer them
> (if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY".

I like Michael's proposal. See v5 of the patch.

Thank you for your review!

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

14 April 2020, 16:09:23

On Fri, 10 Apr 2020 11:00:31 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
[...]
> > > but the mistake here is it thinks that inRecovery represents whether it is
> > > running as a standby or not, but actually it is true on primary during
> > > crash recovery.  
> > 
> > Indeed.
> >   
> > > On the other hand, with the patch, standby with archive_mode=on
> > > wrongly archives WAL files during crash recovery.  
> > 
> > "without the patch" you mean? You are talking about 78ea8b5daab, right?  
> 
> No. I menat the v4 patch in [1].
> 
> [1] https://www.postgresql.org/message-id/20200407171736.61906608%40firost
> 
> Prior to appllying the patch (that is the commit 78ea..),
> XLogArchiveCheckDone() correctly returns true (= allow to remove it)
> for the same condition.
> 
> The proposed patch does the folloing thing.
> 
> if (!XLogArchivingActive() ||
>     recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways())
>     return true;
> 
> It doesn't return for the condition "recoveryState=CRASH_RECOVERING
> and archive_mode = on". Then the WAL files is mitakenly marked as
> ".ready" if not marked yet.

Indeed.

But .ready files are then deleted during the first restartpoint. I'm not sure
how to fix this behavior without making the code too complex.

This is discussed in my last answer to Michael few minutes ago as well.

> By the way, the code seems not following the convention a bit
> here. Let the inserting code be in the same style to the existing code
> around.
> 
> +    if ( ! XLogArchivingActive() )
> 
> I think we don't put the  spaces within the parentheses above.

Indeed, This is fixed in patch v5 sent few minutes ago.

> | ARCHIVE_RECOVERING/CRASH_RECOVERING/NOT_IN_RECOVERY
> 
> The first two and the last one are in different style. *I* prefer them
> (if we use it) be "IN_ARCHIVE_RECOVERY/IN_CRASH_RECOVERY/NOT_IN_RECOVERY".

I like Michael's proposal. See v5 of the patch.

Thank you for your review!

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

16 April 2020, 08:11:00

On Tue, Apr 14, 2020 at 06:03:19PM +0200, Jehan-Guillaume de Rorthais wrote:

Thanks for the new version.

> On Mon, 13 Apr 2020 16:14:14 +0900 Michael Paquier <michael@paquier.xyz> wrote:
>> It seems to me that the initial value of SharedRecoveryState should be
>> CRASH_RECOVERING all the time no?  StartupXLOG() is a code path taken
>> even if starting cleanly, and the flag would be reset correctly at the
>> end to NOT_IN_RECOVERY.
>
> Previous version of the patch was setting CRASH_RECOVERING. Fujii-san reported
> (the 4 Apr 2020 02:49:50 +0900) that it could be useful to expose a better value
> until relevant code is reached in StartupXLOG() so GetRecoveryState() returns
> a safer value for futur use.
>
> As I answered upthread, I'm not sure who would need this information before the
> WAL machinery is up though. Note that ARCHIVE_RECOVERING and CRASH_RECOVERING
> are compatible with the previous behavior.

I am not sure either if we need this information until the startup
process is up, but even if we need it I'd rather keep the code
consistent with the past practice, which was that
SharedRecoveryInProgress got set to true, the equivalent of crash
recovery as that's more generic than the archive recovery switch.

> Maybe the solution would be to init with CRASH_RECOVERING and add some comment
> in GetRecoveryState() to warn the state is "enforced" after the XLOG machinery
> is started and is init'ed to RECOVERING in the meantime?
>
> I initialized it to CRASH_RECOVERING in the new v5 patch and added a comment
> to GetRecoveryState().

Not sure that the comment is worth it.  Your description of the state
looks enough, and the code is clear that we have just an initial
shared memory state in this case.

>> +   /* The file is always deletable if archive_mode is "off". */
>> +   if ( ! XLogArchivingActive() )
>> +       return true;
>> [...]
>> +   if ( recoveryState == ARCHIVE_RECOVERING && !XLogArchivingAlways() )
>>         return true;
>> Incorrect indentation.
>
> Is it the spaces as reported by Horiguchi-san? I removed them in latest patch.

Yes.

>> It would be good to mention that while in crash recovery, files are
>> not considered as deletable.
>
> Well, in fact, I am still wondering about this. I was hesitant to add a
> shortcut like:
>
>   /* no WAL segment cleanup during crash recovery */
>   if (recoveryState == RECOVERT_STATE_CRASH)
>       return false;
>
> But, what if for example we crashed for lack of disk space during intensive
> writes?  During crash recovery, any WAL marked as .done could be removed and
> allow the system to start again and maybe make even further WAL cleanup by
> archiving some more WAL without competing with previous high write ratio.
>
> When we recover a primary, this behavior seems conform with any value of
> archive_mode, even if it has been changed after crash and before starting it.
> On a standby, we might create .ready files, but they will be removed during the
> first restartpoint if needed.

I guess that you mean .ready and not .done here?  Removing .done files
does not matter as the segments related to them are already gone.
Even with that, why should we need to make the backend smarter about
the removal of .ready files during crash recovery.  It seems to me
that we should keep them, and an operator could always come by himself
and do some manual cleanup to free some space in the pg_wal partition.

> I needed a failure so I can test pg_stat_archiver reports it as well. In my
> mind, using "false" would either trigger a failure because false returns 1 or...
> a failure because the command is not found. In either way, the result is the
> same.
>
> Using poll_query_until+pg_stat_file, is a good idea, but not enough as
> archiver reports a failure some moment after the .ready signal file appears.

Using an empty string makes the test more portable, but while I looked
at it I have found out that it leads to delays in the archiver except
if you force the generation of more segments in the test, causing the
logic to get more complicated with the manipulation of the .ready and
.done files.  And I was then finding myself to add an
archive_timeout..  Anyway, this reduced the readability of the test so
I am pretty much giving up on this idea.

>> The end of the tests actually relies on the
>> fact that archive_command is set to "false" when the cold backup is
>> taken, before resetting it.  I think that it would be cleaner to
>> enforce the configuration you want to test before starting each
>> standby.  It becomes easier to understand the flow of the test for the
>> reader.
>
> done as well.

I have put my hands on the code, and attached is a cleaned up
version for the backend part.  Below are some notes.

+ * RECOVERT_STATE_ARCHIVE is set for archive recovery or for a
standby.
Typo here that actually comes from my previous email, and that you
blindly copy-pasted, repeated five times in the tree actually.

+   RecoveryState   recoveryState = GetRecoveryState();
+
+   /* The file is always deletable if archive_mode is "off". */
+   if (!XLogArchivingActive())
+       return true;
There is no point to call GetRecoveryState() is archive_mode = off.

+ * There's two different reasons to recover WAL: when standby mode is
requested
+ * or after a crash to catchup with consistency.
Nit: s/There's/There are/.  Anyway, I don't see much point in keeping
this comment as the description of each state value should be enough,
so I have removed it.

I am currently in the middle of reviewing the test and there are a
couple of things that can be improved but I lack of time today, so
I'll continue tomorrow on it.  There is no need to send two separate
patches by the way as the code paths touched are different.
--
Michael

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

16 April 2020, 22:07:39

On Thu, 16 Apr 2020 17:11:00 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Tue, Apr 14, 2020 at 06:03:19PM +0200, Jehan-Guillaume de Rorthais wrote:
> 
> Thanks for the new version.

Thank you for v6.

> > On Mon, 13 Apr 2020 16:14:14 +0900 Michael Paquier <michael@paquier.xyz>
> > wrote:  
> >> It seems to me that the initial value of SharedRecoveryState should be
> >> CRASH_RECOVERING all the time no?  StartupXLOG() is a code path taken
> >> even if starting cleanly, and the flag would be reset correctly at the
> >> end to NOT_IN_RECOVERY.  
> > 
> > Previous version of the patch was setting CRASH_RECOVERING. Fujii-san
> > reported (the 4 Apr 2020 02:49:50 +0900) that it could be useful to expose
> > a better value until relevant code is reached in StartupXLOG() so
> > GetRecoveryState() returns a safer value for futur use.
> >
> > As I answered upthread, I'm not sure who would need this information before
> > the WAL machinery is up though. Note that ARCHIVE_RECOVERING and
> > CRASH_RECOVERING are compatible with the previous behavior.  
> 
> I am not sure either if we need this information until the startup
> process is up, but even if we need it I'd rather keep the code
> consistent with the past practice, which was that
> SharedRecoveryInProgress got set to true, the equivalent of crash
> recovery as that's more generic than the archive recovery switch.

OK.

> > Maybe the solution would be to init with CRASH_RECOVERING and add some
> > comment in GetRecoveryState() to warn the state is "enforced" after the
> > XLOG machinery is started and is init'ed to RECOVERING in the meantime?
> > 
> > I initialized it to CRASH_RECOVERING in the new v5 patch and added a comment
> > to GetRecoveryState().  
> 
> Not sure that the comment is worth it.  Your description of the state
> looks enough, and the code is clear that we have just an initial
> shared memory state in this case.

OK. Will remove later after your review of the tests.

> >> It would be good to mention that while in crash recovery, files are
> >> not considered as deletable.  
> > 
> > Well, in fact, I am still wondering about this. I was hesitant to add a
> > shortcut like:
> > 
> >   /* no WAL segment cleanup during crash recovery */
> >   if (recoveryState == RECOVERT_STATE_CRASH)
> >       return false;
> > 
> > But, what if for example we crashed for lack of disk space during intensive
> > writes?  During crash recovery, any WAL marked as .done could be removed and
> > allow the system to start again and maybe make even further WAL cleanup by
> > archiving some more WAL without competing with previous high write ratio.
> >
> > When we recover a primary, this behavior seems conform with any value of
> > archive_mode, even if it has been changed after crash and before starting
> > it. On a standby, we might create .ready files, but they will be removed
> > during the first restartpoint if needed.  
> 
> I guess that you mean .ready and not .done here?

No, I meant .done.

> Removing .done files does not matter as the segments related to them are
> already gone.

Not necessarily. There's a time windows between the moment the archiver set
the .done file and when the checkpointer removes the associated WAL file.
So, after a PANIC because of lack of space, WAL associated with .done files but
not removed yet will be removed during the crash recovery.

> Even with that, why should we need to make the backend smarter about
> the removal of .ready files during crash recovery.  It seems to me
> that we should keep them, and an operator could always come by himself
> and do some manual cleanup to free some space in the pg_wal partition.

We are agree on this.

> > I needed a failure so I can test pg_stat_archiver reports it as well. In my
> > mind, using "false" would either trigger a failure because false returns 1
> > or... a failure because the command is not found. In either way, the result
> > is the same.
> > 
> > Using poll_query_until+pg_stat_file, is a good idea, but not enough as
> > archiver reports a failure some moment after the .ready signal file
> > appears.  
> 
> Using an empty string makes the test more portable, but while I looked
> at it I have found out that it leads to delays in the archiver except
> if you force the generation of more segments in the test, causing the
> logic to get more complicated with the manipulation of the .ready and
> .done files.  And I was then finding myself to add an
> archive_timeout..  Anyway, this reduced the readability of the test so
> I am pretty much giving up on this idea.

Unless I'm wrong, the empty string does not raise an error in pg_stat_archiver,
and I wanted to add a test on this as well.

> >> The end of the tests actually relies on the
> >> fact that archive_command is set to "false" when the cold backup is
> >> taken, before resetting it.  I think that it would be cleaner to
> >> enforce the configuration you want to test before starting each
> >> standby.  It becomes easier to understand the flow of the test for the
> >> reader.  
> > 
> > done as well.  
> 
> I have put my hands on the code, and attached is a cleaned up
> version for the backend part.  Below are some notes.
> 
> + * RECOVERT_STATE_ARCHIVE is set for archive recovery or for a
> standby.
> Typo here that actually comes from my previous email, and that you
> blindly copy-pasted, repeated five times in the tree actually.

Oops...yes, even in a comment with RECOVERT_STATE_NONE :/
Sorry.

> +   RecoveryState   recoveryState = GetRecoveryState();
> +
> +   /* The file is always deletable if archive_mode is "off". */
> +   if (!XLogArchivingActive())
> +       return true;
> There is no point to call GetRecoveryState() is archive_mode = off.

good catch!

> + * There's two different reasons to recover WAL: when standby mode is
> requested
> + * or after a crash to catchup with consistency.
> Nit: s/There's/There are/.  Anyway, I don't see much point in keeping
> this comment as the description of each state value should be enough,
> so I have removed it.

OK

> I am currently in the middle of reviewing the test and there are a
> couple of things that can be improved but I lack of time today, so
> I'll continue tomorrow on it.  There is no need to send two separate
> patches by the way as the code paths touched are different.

OK.

Thanks for your review! Let me know if you want me to add/change/fix some tests.

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

17 April 2020, 06:50:43

On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote:
> On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael@paquier.xyz> wrote:
>> Removing .done files does not matter as the segments related to them are
>> already gone.
>
> Not necessarily. There's a time windows between the moment the archiver set
> the .done file and when the checkpointer removes the associated WAL file.
> So, after a PANIC because of lack of space, WAL associated with .done files but
> not removed yet will be removed during the crash recovery.

Not sure that it is something that matters for this thread though, so
if necessary I think that it could be be discussed separately.

>> Even with that, why should we need to make the backend smarter about
>> the removal of .ready files during crash recovery.  It seems to me
>> that we should keep them, and an operator could always come by himself
>> and do some manual cleanup to free some space in the pg_wal partition.
>
> We are agree on this.

Okay.

> Unless I'm wrong, the empty string does not raise an error in pg_stat_archiver,
> and I wanted to add a test on this as well.

Exactly, it won't raise an error.  Instead I switched to use a poll
query with pg_stat_file() and .ready files, but this has proved to
delay the test considerably if we did not create more segments.  And
your approach has the merit to be more simple with only two segments
manipulated for the whole test.  So I have tried first my idea,
noticed the mess it introduced, and just kept your approach.

> Thanks for your review! Let me know if you want me to add/change/fix some tests.

Thanks, I have worked more on the test, refactoring pieces related to
the segment names, adjusting some comments and fixing some of the
logic.  Note that you introduced something incorrect at the creation
of $standby2 as you have been updating postgresql.conf.auto for
$standby1.

I have noticed an extra issue while looking at the backend pieces
today: at the beginning of the REDO loop we forgot one place where
SharedRecoveryState *has* to be updated to a correct state (around
the comment "Update pg_control to show that we are..." in xlog.c) as
the startup process may decide to switch the control file state to
DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
update the new shared flag at this early stage.  It did not matter
before because SharedRecoveryInProgress would be only "true" for both,
but that's not the case anymore as we need to make the difference
between crash recovery and archive recovery in the new flag.  There is
no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
as the initial shared memory state is RECOVERY_STATE_CRASH, but
updating the flag makes the code more consistent IMO so I updated it
anyway in the attached.

I have the feeling that I need to work a bit more on this patch, but
my impression is that we are getting to something committable here.

Thoughts?
--
Michael

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

17 April 2020, 13:33:04

On Fri, 17 Apr 2020 15:50:43 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael@paquier.xyz>
> > wrote:  
> >> Removing .done files does not matter as the segments related to them are
> >> already gone.  
> > 
> > Not necessarily. There's a time windows between the moment the archiver set
> > the .done file and when the checkpointer removes the associated WAL file.
> > So, after a PANIC because of lack of space, WAL associated with .done files
> > but not removed yet will be removed during the crash recovery.  
> 
> Not sure that it is something that matters for this thread though, so
> if necessary I think that it could be be discussed separately.

OK. However, unless I'm wrong, what I am describing as an desired behavior
is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
v8 should return false during crash recovery no matter the archive_mode setup,
or if we keep the curent behavior. I vote for keeping it this way.

[...]
> > Unless I'm wrong, the empty string does not raise an error in
> > pg_stat_archiver, and I wanted to add a test on this as well.  
> 
> Exactly, it won't raise an error.  Instead I switched to use a poll
> query with pg_stat_file() and .ready files, but this has proved to
> delay the test considerably if we did not create more segments.  And
> your approach has the merit to be more simple with only two segments
> manipulated for the whole test.  So I have tried first my idea,
> noticed the mess it introduced, and just kept your approach.

Maybe we could use something more common for all plateform? Eg.:

  archive_command='this command does not exist'

At least, we would have the same error everywhere, as far as it could matter...

> > Thanks for your review! Let me know if you want me to add/change/fix some
> > tests.  
> 
> Thanks, I have worked more on the test, refactoring pieces related to
> the segment names, adjusting some comments and fixing some of the
> logic.  Note that you introduced something incorrect at the creation
> of $standby2 as you have been updating postgresql.conf.auto for
> $standby1.

erf, last minute quick edit with lack of review on my side :(

> I have noticed an extra issue while looking at the backend pieces
> today: at the beginning of the REDO loop we forgot one place where
> SharedRecoveryState *has* to be updated to a correct state (around
> the comment "Update pg_control to show that we are..." in xlog.c) as
> the startup process may decide to switch the control file state to
> DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
> update the new shared flag at this early stage.  It did not matter
> before because SharedRecoveryInProgress would be only "true" for both,
> but that's not the case anymore as we need to make the difference
> between crash recovery and archive recovery in the new flag.  There is
> no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
> as the initial shared memory state is RECOVERY_STATE_CRASH, but
> updating the flag makes the code more consistent IMCRASHO so I updated it
> anyway in the attached.

Grmbl...I had this logic the other way around: init with
RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
based on ControlFile->state.

Sorry, I forgot it after discussing the init value in v5 :(

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

18 April 2020, 09:26:11

On Fri, Apr 17, 2020 at 03:33:04PM +0200, Jehan-Guillaume de Rorthais wrote:
> On Fri, 17 Apr 2020 15:50:43 +0900 Michael Paquier <michael@paquier.xyz> wrote:
>> Not sure that it is something that matters for this thread though, so
>> if necessary I think that it could be discussed separately.
>
> OK. However, unless I'm wrong, what I am describing as a desired behavior
> is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
> v8 should return false during crash recovery no matter the archive_mode setup,
> or if we keep the current behavior. I vote for keeping it this way.

I would rather avoid that now, as we don't check explicitely for crash
recovery in this code path.  And for the purpose of this patch it is
fine to stick with the extra check on a standby with
(RECOVERY_STATE_ARCHIVE && archive_mode = always).

> Maybe we could use something more common for all plateform? Eg.:
>
>   archive_command='this command does not exist'
>
> At least, we would have the same error everywhere, as far as it could matter...

Yeah.  We could try to do with "false" as command anyway, and see what
the buildfarm thinks.  As the test is skipped on Windows, I would
assume that it does not matter much anyway.  Let's see what others
think about this piece.  I don't have plans to touch again this patch
until likely the middle of next week.

>> Thanks, I have worked more on the test, refactoring pieces related to
>> the segment names, adjusting some comments and fixing some of the
>> logic.  Note that you introduced something incorrect at the creation
>> of $standby2 as you have been updating postgresql.conf.auto for
>> $standby1.
>
> erf, last minute quick edit with lack of review on my side :(

No problem.  It happens.

>> I have noticed an extra issue while looking at the backend pieces
>> today: at the beginning of the REDO loop we forgot one place where
>> SharedRecoveryState *has* to be updated to a correct state (around
>> the comment "Update pg_control to show that we are..." in xlog.c) as
>> the startup process may decide to switch the control file state to
>> DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
>> update the new shared flag at this early stage.  It did not matter
>> before because SharedRecoveryInProgress would be only "true" for both,
>> but that's not the case anymore as we need to make the difference
>> between crash recovery and archive recovery in the new flag.  There is
>> no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
>> as the initial shared memory state is RECOVERY_STATE_CRASH, but
>> updating the flag makes the code more consistent IMCRASHO so I updated it
>> anyway in the attached.
>
> Grmbl...I had this logic the other way around: init with
> RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
> removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
> based on ControlFile->state.
>
> Sorry, I forgot it after discussing the init value in v5 :(

Indeed.  The extra initialization was part of v4, and got removed as
of v5.  Still, it seems to me that this part was not complete without
updating the shared memory field correctly at the beginning of the
REDO processing as the last version of the patch does.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

20 April 2020, 07:02:31

At Sat, 18 Apr 2020 18:26:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Apr 17, 2020 at 03:33:04PM +0200, Jehan-Guillaume de Rorthais wrote:
> > On Fri, 17 Apr 2020 15:50:43 +0900 Michael Paquier <michael@paquier.xyz> wrote:
> >> Not sure that it is something that matters for this thread though, so
> >> if necessary I think that it could be discussed separately.
> > 
> > OK. However, unless I'm wrong, what I am describing as a desired behavior
> > is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
> > v8 should return false during crash recovery no matter the archive_mode setup,
> > or if we keep the current behavior. I vote for keeping it this way.
> 
> I would rather avoid that now, as we don't check explicitely for crash
> recovery in this code path.  And for the purpose of this patch it is
> fine to stick with the extra check on a standby with
> (RECOVERY_STATE_ARCHIVE && archive_mode = always).

The commit 78ea8b5daa intends that WAL segments are properly removed
on standby with archive_mode=on by not marking .ready.  The v7
actually let such segments be marked .ready, but they are finally
removed after entering archive recovery.  It preserves the patch's
intention in that perspective.  (I'd rather prefer to distinguish
"ArchiveRecoveryRequested" somehow but it would be more complex and it
is not the agreement on this thread.)

As the result, +1 to what v7 is doing and discussing on earlier
removal of such WAL segments separately if needed.


> > Maybe we could use something more common for all plateform? Eg.:
> > 
> >   archive_command='this command does not exist'
> > 
> > At least, we would have the same error everywhere, as far as it could matter...
> 
> Yeah.  We could try to do with "false" as command anyway, and see what
> the buildfarm thinks.  As the test is skipped on Windows, I would
> assume that it does not matter much anyway.  Let's see what others
> think about this piece.  I don't have plans to touch again this patch
> until likely the middle of next week.

Couldn't we use "/" as a globally-results-in-failure command?  But
that doesn't increment failed_count.  The reason is pgarch_archiveXLog
exits with FATAL for "is a directory" error.  The comment asserts that
we exit with FATAL for SIGINT or SIGQUIT and if so it is enough to
check only exit-by-signal case.  The following fix worked.

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 01ffd6513c..def6a68063 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -595,7 +595,7 @@ pgarch_archiveXlog(char *xlog)
          * "command not found" type of error.  If we overreact it's no big
          * deal, the postmaster will just start the archiver again.
          */
-        int            lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;
+        int            lev = wait_result_is_any_signal(rc, false) ? FATAL : LOG;
 
         if (WIFEXITED(rc))
         {

I didn't tested it on Windows (I somehow broke my repo and it's too
slow to clone.) but system("/") returned 1 and I think that result
increments the counter.

> >> Thanks, I have worked more on the test, refactoring pieces related to
> >> the segment names, adjusting some comments and fixing some of the
> >> logic.  Note that you introduced something incorrect at the creation
> >> of $standby2 as you have been updating postgresql.conf.auto for
> >> $standby1.
> > 
> > erf, last minute quick edit with lack of review on my side :(
> 
> No problem.  It happens.
> 
> >> I have noticed an extra issue while looking at the backend pieces
> >> today: at the beginning of the REDO loop we forgot one place where
> >> SharedRecoveryState *has* to be updated to a correct state (around
> >> the comment "Update pg_control to show that we are..." in xlog.c) as
> >> the startup process may decide to switch the control file state to
> >> DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
> >> update the new shared flag at this early stage.  It did not matter
> >> before because SharedRecoveryInProgress would be only "true" for both,
> >> but that's not the case anymore as we need to make the difference
> >> between crash recovery and archive recovery in the new flag.  There is
> >> no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
> >> as the initial shared memory state is RECOVERY_STATE_CRASH, but
> >> updating the flag makes the code more consistent IMCRASHO so I updated it
> >> anyway in the attached.
> > 
> > Grmbl...I had this logic the other way around: init with
> > RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
> > removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
> > based on ControlFile->state.
> > 
> > Sorry, I forgot it after discussing the init value in v5 :(
> 
> Indeed.  The extra initialization was part of v4, and got removed as
> of v5.  Still, it seems to me that this part was not complete without
> updating the shared memory field correctly at the beginning of the
> REDO processing as the last version of the patch does.

I may not be following the discussion, but I think it is reasonable
that SharedRecoveryState is initialized as CRASH then moves to ARCHIVE
as needed and finished by NONE.  That transition also stables
RecoveryInProgress().


Other minor comments:

+    RECOVERY_STATE_NONE            /* currently in production */

I think it would be better be RECOVERY_STATE_DONE.

By the way I noticed that RecoveryState is exactly a subset of
DBState.  And changes of SharedRecoveryState happens side-by-side with
ControlFileData->state in most places.  Coundn't we just usee
ControlFile->state instead of SharedRecoveryState?


By the way I found a typo.

+# Recovery tests for the achiving with a standby partially check
s/achiving/archiving/

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

20 April 2020, 07:34:44

On Mon, Apr 20, 2020 at 04:02:31PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 18 Apr 2020 18:26:11 +0900, Michael Paquier <michael@paquier.xyz> wrote in
> As the result, +1 to what v7 is doing and discussing on earlier
> removal of such WAL segments separately if needed.

Thanks for the extra review.

>> Yeah.  We could try to do with "false" as command anyway, and see what
>> the buildfarm thinks.  As the test is skipped on Windows, I would
>> assume that it does not matter much anyway.  Let's see what others
>> think about this piece.  I don't have plans to touch again this patch
>> until likely the middle of next week.
>
> Couldn't we use "/" as a globally-results-in-failure command?  But
> that doesn't increment failed_count.  The reason is pgarch_archiveXLog
> exits with FATAL for "is a directory" error.  The comment asserts that
> we exit with FATAL for SIGINT or SIGQUIT and if so it is enough to
> check only exit-by-signal case.  The following fix worked.

Yeah, I was working on this stuff today and I noticed this problem.  I
was just going to send an email on the matter with a more portable
patch and you also just beat me to it with this one :)

So yes, using "false" may be a bad idea because we cannot rely on the
case where the command does not exist in an environment in this test.
After more testing, I have been hit hard about the fact that the
archiver exits immediately if an archive command cannot be found
(errcode = 127), and it does not report this failure back to
pg_stat_archiver, which would cause the test to wait until the timeout
of poll_query_until() kills the test.  There is however an extra
method not mentioned yet on this thread: we know that cp/copy is
portable enough per the buildfarm, so let's use a copy command that we
know *will* fail.  A simple way of doing this is a command where the
origin file does not exist.

> --- a/src/backend/postmaster/pgarch.c
> +++ b/src/backend/postmaster/pgarch.c
> @@ -595,7 +595,7 @@ pgarch_archiveXlog(char *xlog)
>           * "command not found" type of error.  If we overreact it's no big
>           * deal, the postmaster will just start the archiver again.
>           */
> -        int            lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;
> +        int            lev = wait_result_is_any_signal(rc, false) ? FATAL : LOG;
>
>          if (WIFEXITED(rc))
>          {
>
> I didn't tested it on Windows (I somehow broke my repo and it's too
> slow to clone.) but system("/") returned 1 and I think that result
> increments the counter.

No, this would be a behavior change, which is not acceptable in my
view.  (By the way, just nuke your full repo if it does not work
anymore on Windows, this method works).

>> Indeed.  The extra initialization was part of v4, and got removed as
>> of v5.  Still, it seems to me that this part was not complete without
>> updating the shared memory field correctly at the beginning of the
>> REDO processing as the last version of the patch does.
>
> I may not be following the discussion, but I think it is reasonable
> that SharedRecoveryState is initialized as CRASH then moves to ARCHIVE
> as needed and finished by NONE.  That transition also stables
> RecoveryInProgress().

Thought as well about that over the weekend, and that's still the best
option to me.

> I think it would be better be RECOVERY_STATE_DONE.

I like this suggestion better than the original in v7.

> By the way I noticed that RecoveryState is exactly a subset of
> DBState.  And changes of SharedRecoveryState happens side-by-side with
> ControlFileData->state in most places.  Coundn't we just usee
> ControlFile->state instead of SharedRecoveryState?

I actually found confusing to use the same thing, because then the
reader would thing that SharedRecoveryState could be set to more
values but we don't want that.

> By the way I found a typo.
>
> +# Recovery tests for the achiving with a standby partially check
> s/achiving/archiving/

Thanks, fixed.

Attached is an updated patch, where I tweaked more comments.

Jehan-Guillaume, who is your colleague who found originally about this
problem?  We should credit him in the commit message.
--
Michael

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

20 April 2020, 12:22:35

On Mon, 20 Apr 2020 16:34:44 +0900
Michael Paquier <michael@paquier.xyz> wrote:
[...]
> > By the way I noticed that RecoveryState is exactly a subset of
> > DBState.  And changes of SharedRecoveryState happens side-by-side with
> > ControlFileData->state in most places.  Coundn't we just usee
> > ControlFile->state instead of SharedRecoveryState?
>
> I actually found confusing to use the same thing, because then the
> reader would thing that SharedRecoveryState could be set to more
> values but we don't want that.

I thought about this while studying various possible fix.

https://www.postgresql.org/message-id/flat/20200401181735.11100908%40firost#6192afba4e4549b8d9bac03168bad46b

The problem is that we would have to read the controldata file each time we
wonder if a segment should be archived/removed. Moreover, the controldata
file might not be in sync quickly enough with the real state for some other
code path or futur needs.

[...]
> Attached is an updated patch, where I tweaked more comments.
>
> Jehan-Guillaume, who is your colleague who found originally about this
> problem?  We should credit him in the commit message.

Indeed, Benoît Lobréau reported this behavior to me.

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

21 April 2020, 02:15:01

On Mon, Apr 20, 2020 at 02:22:35PM +0200, Jehan-Guillaume de Rorthais wrote:
> The problem is that we would have to read the controldata file each time we
> wonder if a segment should be archived/removed. Moreover, the controldata
> file might not be in sync quickly enough with the real state for some other
> code path or futur needs.

I don't think that this is what Horiguchi-san meant here.  What I got
from his previous message would be to be to copy the shared value from
the control file when necessary, and have the shared state use only a
subset of the existing values of DBState, aka:
- DB_IN_CRASH_RECOVERY
- DB_IN_ARCHIVE_RECOVERY
- DB_IN_PRODUCTION
Still, that sounds wrong to me because then somebody would be tempted
to change the shared value thinking that things like DB_SHUTDOWNING,
DB_SHUTDOWNED_* or DB_STARTUP are valid but we don't want that here.
Note that there may be a case for DB_STARTUP to be used in
XLOGShmemInit(), but I'd rather let the code use the safest default,
DB_IN_CRASH_RECOVERY to control that we won't remove .ready files by
default until the startup process sees fit to do the actual switch
depending on the checkpoint record lookup, if archive recovery was
actually requested, etc.

> Indeed, Benoît Lobréau reported this behavior to me.

Noted.  Thanks for the information.  I don't think that I have ever
met Benoît in person, do I?  Tell him that I owe him one beer or a
beverage of his choice when we meet IRL, and that he had better use
this message-id to make me keep my promise :)
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

21 April 2020, 03:09:25

At Tue, 21 Apr 2020 11:15:01 +0900, Michael Paquier <michael@paquier.xyz> wrote in
> On Mon, Apr 20, 2020 at 02:22:35PM +0200, Jehan-Guillaume de Rorthais wrote:
> > The problem is that we would have to read the controldata file each time we
> > wonder if a segment should be archived/removed. Moreover, the controldata
> > file might not be in sync quickly enough with the real state for some other
> > code path or futur needs.
>
> I don't think that this is what Horiguchi-san meant here.  What I got
> from his previous message would be to be to copy the shared value from
> the control file when necessary, and have the shared state use only a
> subset of the existing values of DBState, aka:
> - DB_IN_CRASH_RECOVERY
> - DB_IN_ARCHIVE_RECOVERY
> - DB_IN_PRODUCTION

First I thought as above, but I thought that we could use
ControlFile->state itself in this case, by regarding the symbols less
than DB_IN_CRASH_RECOVERY as RECOVERY_STATE_CRASH.  I don't think
there's no problem if the update to DB_IN_ARCHIVE_RECOVERY reaches
checkpointer with some delay.

> Still, that sounds wrong to me because then somebody would be tempted
> to change the shared value thinking that things like DB_SHUTDOWNING,
> DB_SHUTDOWNED_* or DB_STARTUP are valid but we don't want that here.

That is not an issue if we just use DBState to know whether we have
started archive recovery.

> Note that there may be a case for DB_STARTUP to be used in
> XLOGShmemInit(), but I'd rather let the code use the safest default,
> DB_IN_CRASH_RECOVERY to control that we won't remove .ready files by
> default until the startup process sees fit to do the actual switch
> depending on the checkpoint record lookup, if archive recovery was
> actually requested, etc.

I'm not sure I read this correctly. But I think I agree to this.

+    if (!XLogArchivingAlways() &&
+        GetRecoveryState() == RECOVERY_STATE_ARCHIVE)

Is rewritten as

+    if (!XLogArchivingAlways() &&
+        GetDBState() > DB_IN_CRASH_RECOVERY)

FWIW, what annoyed me is there are three variables that are quite
similar but has different domains, ControlFile->state,
XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
mind there were two, but three seems a bit too many to me.

But it may be different issue.

> > Indeed, Benoît Lobréau reported this behavior to me.
>
> Noted.  Thanks for the information.  I don't think that I have ever
> met Benoît in person, do I?  Tell him that I owe him one beer or a
> beverage of his choice when we meet IRL, and that he had better use
> this message-id to make me keep my promise :)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

21 April 2020, 04:57:39

On Tue, Apr 21, 2020 at 12:09:25PM +0900, Kyotaro Horiguchi wrote:
> +    if (!XLogArchivingAlways() &&
> +        GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
>
> Is rewritten as
>
> +    if (!XLogArchivingAlways() &&
> +        GetDBState() > DB_IN_CRASH_RECOVERY)
>
> FWIW, what annoyed me is there are three variables that are quite
> similar but has different domains, ControlFile->state,
> XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
> mind there were two, but three seems a bit too many to me.

That's actually the pattern I would avoid for clarity.  There is no
need to add more dependencies to the entries of DBState for the sake
of this patch, and this smells like a trap if more values are added to
it in an order that does not match what we have been assuming in the
context of this thread.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

21 April 2020, 06:08:17

At Tue, 21 Apr 2020 13:57:39 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Apr 21, 2020 at 12:09:25PM +0900, Kyotaro Horiguchi wrote:
> > +    if (!XLogArchivingAlways() &&
> > +        GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
> > 
> > Is rewritten as 
> > 
> > +    if (!XLogArchivingAlways() &&
> > +        GetDBState() > DB_IN_CRASH_RECOVERY)
> > 
> > FWIW, what annoyed me is there are three variables that are quite
> > similar but has different domains, ControlFile->state,
> > XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
> > mind there were two, but three seems a bit too many to me.
> 
> That's actually the pattern I would avoid for clarity.  There is no
> need to add more dependencies to the entries of DBState for the sake
> of this patch, and this smells like a trap if more values are added to
> it in an order that does not match what we have been assuming in the
> context of this thread.

Yes. Anywaay that would be another issue, if it is an issue.

I'm fine with the current state.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

21 April 2020, 22:00:22

On Tue, 21 Apr 2020 15:08:17 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Tue, 21 Apr 2020 13:57:39 +0900, Michael Paquier <michael@paquier.xyz>
> wrote in 
> > On Tue, Apr 21, 2020 at 12:09:25PM +0900, Kyotaro Horiguchi wrote:  
> > > +    if (!XLogArchivingAlways() &&
> > > +        GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
> > > 
> > > Is rewritten as 
> > > 
> > > +    if (!XLogArchivingAlways() &&
> > > +        GetDBState() > DB_IN_CRASH_RECOVERY)
> > > 
> > > FWIW, what annoyed me is there are three variables that are quite
> > > similar but has different domains, ControlFile->state,
> > > XLogCtl->SharedRecoveryState, and LocalRecoveryInProgress. I didn't
> > > mind there were two, but three seems a bit too many to me.  

In fact, my original goal [1] was to avoid adding another shared boolean related
to the same topic. So I understand your feeling.

I played with this idea again, based on your argument that there's no problem
if the update to DB_IN_ARCHIVE_RECOVERY reaches checkpointer with some delay.

The other point I feel bad with is to open and check the controlfile again and
again for each segment to archive, even on running production instance.
It's not that it would be heavy, but it feels overkill to fetch this information
that should be available more easily.

That leads me an idea where we would keep the ControlFile data up-to-date in
shared memory. There's a few duplicates between ControlFile and XLogCtl, so
maybe it could make the code a little simpler at some other places than just
fixing $SUBJECT using DBState? I'm not sure of the implications and impacts
though. This seems way bigger than the current fix and with many traps on the
way. Maybe we could discuss this in another thread if you think it deserves it?

Regards,

[1]
https://www.postgresql.org/message-id/flat/20200403182625.0fccc6fd%40firost#28a756094a4b1f3dd24927fb6311927d

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

21 April 2020, 22:41:21

Hello,

I did another round of review of v8.

- LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+ LocalRecoveryInProgress = (xlogctl->SharedRecoveryState !=
  RECOVERY_STATE_DONE);

Do we need to acquire info_lck to look at the state here, as we do in
GetRecoveryState()? Why is it missing from previous code where
SharedRecoveryInProgress was protected by info_lck as well?

Plus, the new line length overflow the 80-column, but I'm not sure where to
break this line.

+if ($Config{osname} eq 'MSWin32')
+{
+
+    # some Windows Perls at least don't like IPC::Run's start/kill_kill
regime.
+    plan skip_all => "Test fails on Windows perl";
+}

In fact, this was inherited from 011_crash_recovery.pl where I originally
added some tests. As 020_archive_status.pl doesn't use IPC::Run, the comment is
wrong. But I wonder if this whole block is really needed. Unfortunately I can't
test on MSWin32 :/

On Tue, 21 Apr 2020 11:15:01 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> > Indeed, Benoît Lobréau reported this behavior to me.
>
> Noted.  Thanks for the information.  I don't think that I have ever
> met Benoît in person, do I?

I don't think so.

> Tell him that I owe him one beer or a beverage of his choice when we meet
> IRL, and that he had better use this message-id to make me keep my promise :)

I told him (but I'm sure he was reading anyway :)).

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

21 April 2020, 23:07:52

On Wed, Apr 22, 2020 at 12:00:22AM +0200, Jehan-Guillaume de Rorthais wrote:
> That leads me an idea where we would keep the ControlFile data up-to-date in
> shared memory. There's a few duplicates between ControlFile and XLogCtl, so
> maybe it could make the code a little simpler at some other places than just
> fixing $SUBJECT using DBState? I'm not sure of the implications and impacts
> though. This seems way bigger than the current fix and with many traps on the
> way. Maybe we could discuss this in another thread if you think it deserves it?

It seems to me that this could have wider applications than just the
recovery state, no?  I would recommend to keep this discussion on a
separate thread to give more visibility to the topic.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

22 April 2020, 01:19:35

On Wed, Apr 22, 2020 at 12:41:21AM +0200, Jehan-Guillaume de Rorthais wrote:
> Do we need to acquire info_lck to look at the state here, as we do in
> GetRecoveryState()? Why is it missing from previous code where
> SharedRecoveryInProgress was protected by info_lck as well?

Please see 1a3d104.

> Plus, the new line length overflow the 80-column, but I'm not sure where to
> break this line.

pgindent has been run on v8, and it did not complain.

> In fact, this was inherited from 011_crash_recovery.pl where I originally
> added some tests. As 020_archive_status.pl doesn't use IPC::Run, the comment is
> wrong. But I wonder if this whole block is really needed. Unfortunately I can't
> test on MSWin32 :/

You are right here.  The restriction can be removed, and I have
checked that the test from v8 is able to pass on my Windows dev VM.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

22 April 2020, 07:32:23

On Wed, Apr 22, 2020 at 10:19:35AM +0900, Michael Paquier wrote:
> You are right here.  The restriction can be removed, and I have
> checked that the test from v8 is able to pass on my Windows dev VM.

Attached are versions for each branch down to 9.5.  While working on
the backpatch, I have not found major conflicts except one thing:
up to 10, Postgres does WAL segment recycling after two completed
checkpoints, and the 8th test of the script relies on the behavior of
11~ of one completed checkpoint (first .ready file present in the cold
backup but removed removed from $standby1).  I have taken the simplest
approach to fix the test by checking that the .ready file actually
exists, while the rest of the test remains the same.

It is worth noting that for 9.5 and 9.6 the test had compatibility
issues with the renaming of pg_xlog to pg_wal, including paths and
functions.  The calls to poll_query_until() also needed tweaks, but
I got the tests to work.
--
Michael

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

22 April 2020, 14:14:20

On Wed, 22 Apr 2020 10:19:35 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Wed, Apr 22, 2020 at 12:41:21AM +0200, Jehan-Guillaume de Rorthais wrote:
> > Do we need to acquire info_lck to look at the state here, as we do in
> > GetRecoveryState()? Why is it missing from previous code where
> > SharedRecoveryInProgress was protected by info_lck as well?  
> 
> Please see 1a3d104.

Understood. Interesting. Thank you.

> > Plus, the new line length overflow the 80-column, but I'm not sure where to
> > break this line.  
> 
> pgindent has been run on v8, and it did not complain.

OK.

> > In fact, this was inherited from 011_crash_recovery.pl where I originally
> > added some tests. As 020_archive_status.pl doesn't use IPC::Run, the
> > comment is wrong. But I wonder if this whole block is really needed.
> > Unfortunately I can't test on MSWin32 :/  
> 
> You are right here.  The restriction can be removed, and I have
> checked that the test from v8 is able to pass on my Windows dev VM.

Thanks!

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

22 April 2020, 15:58:24

On Wed, 22 Apr 2020 16:32:23 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Wed, Apr 22, 2020 at 10:19:35AM +0900, Michael Paquier wrote:
> > You are right here.  The restriction can be removed, and I have
> > checked that the test from v8 is able to pass on my Windows dev VM.  
> 
> Attached are versions for each branch down to 9.5.  While working on
> the backpatch, I have not found major conflicts except one thing:
> up to 10, Postgres does WAL segment recycling after two completed
> checkpoints, and the 8th test of the script relies on the behavior of
> 11~ of one completed checkpoint (first .ready file present in the cold
> backup but removed removed from $standby1).  I have taken the simplest
> approach to fix the test by checking that the .ready file actually
> exists, while the rest of the test remains the same.

This test seems useless to me. It should either be removed or patched to test
the signal has been removed after a second restartpoint.

Please, find in attachment a patch for 9.6 implementing this. If it seems
reasonable to you, I can create the backpatch to 9.5.

> It is worth noting that for 9.5 and 9.6 the test had compatibility
> issues with the renaming of pg_xlog to pg_wal, including paths and
> functions.  The calls to poll_query_until() also needed tweaks, but
> I got the tests to work.

Thanks for the backpatching work!

Regards,

Attachment

0001-v9-Fix-WAL-retention-during-production-crash-recovery.patch

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

22 April 2020, 16:17:17

On Wed, 22 Apr 2020 17:58:24 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> On Wed, 22 Apr 2020 16:32:23 +0900
> Michael Paquier <michael@paquier.xyz> wrote:
> 
>  [...]  
>  [...]  
>  [...]  
> 
> This test seems useless to me. It should either be removed or patched to test
> the signal has been removed after a second restartpoint.
> 
> Please, find in attachment a patch for 9.6 implementing this. If it seems
> reasonable to you, I can create the backpatch to 9.5.

I found an extra useless line of code in v9 patch. Please, find in
attachment v10. Sorry for this.

Regards,

Attachment

0001-v10-Fix-WAL-retention-during-production-crash-recovery.patch

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

22 April 2020, 23:46:18

On Wed, Apr 22, 2020 at 06:17:17PM +0200, Jehan-Guillaume de Rorthais wrote:
> I found an extra useless line of code in v9 patch. Please, find in
> attachment v10. Sorry for this.

Thanks for helping here, your changes make sense.  This looks mostly
fine to me except that part:
+$standby1->poll_query_until('postgres',
+   qq{ SELECT pg_xlog_location_diff('$primary_lsn', pg_last_xlog_replay_location()) = 0 })
+  or die "Timed out while waiting for xlog replay";
Here we should check if $primary_lsn is at least
pg_last_xlog_replay_location().  Checking for an equality may stuck
the test if more WAL gets replayed.  For example you could have a
concurrent autovacuum generating WAL.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

23 April 2020, 05:05:46

FWIW, the test for 10- looks fine, but I have one qustion.

+     'archive success reported in pg_stat_archiver for WAL segment $segment_name_

This seems intending to show an actual segment name in the message,
but it is really shown as "... WAL segment $segment_name_1". Is that
intended?

At Thu, 23 Apr 2020 08:46:18 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Wed, Apr 22, 2020 at 06:17:17PM +0200, Jehan-Guillaume de Rorthais wrote:
> > I found an extra useless line of code in v9 patch. Please, find in
> > attachment v10. Sorry for this.
> 
> Thanks for helping here, your changes make sense.  This looks mostly
> fine to me except that part:
> +$standby1->poll_query_until('postgres',
> +   qq{ SELECT pg_xlog_location_diff('$primary_lsn', pg_last_xlog_replay_location()) = 0 })
> +  or die "Timed out while waiting for xlog replay";
> Here we should check if $primary_lsn is at least
> pg_last_xlog_replay_location().  Checking for an equality may stuck
> the test if more WAL gets replayed.  For example you could have a
> concurrent autovacuum generating WAL.

Autovacuum is turned off in this case, but anyway other kinds of WAL
records can be generated.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

23 April 2020, 16:59:53

On Thu, 23 Apr 2020 14:05:46 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> FWIW, the test for 10- looks fine, but I have one qustion.
> 
> +     'archive success reported in pg_stat_archiver for WAL segment
> $segment_name_
> 
> This seems intending to show an actual segment name in the message,
> but it is really shown as "... WAL segment $segment_name_1". Is that
> intended?

Good catch. Fixed in v11. Thank you.

> At Thu, 23 Apr 2020 08:46:18 +0900, Michael Paquier <michael@paquier.xyz>
> wrote in 
> > On Wed, Apr 22, 2020 at 06:17:17PM +0200, Jehan-Guillaume de Rorthais
> > wrote:  
> > > I found an extra useless line of code in v9 patch. Please, find in
> > > attachment v10. Sorry for this.  
> > 
> > Thanks for helping here, your changes make sense.  This looks mostly
> > fine to me except that part:
> > +$standby1->poll_query_until('postgres',
> > +   qq{ SELECT pg_xlog_location_diff('$primary_lsn',
> > pg_last_xlog_replay_location()) = 0 })
> > +  or die "Timed out while waiting for xlog replay";
> > Here we should check if $primary_lsn is at least
> > pg_last_xlog_replay_location().  Checking for an equality may stuck
> > the test if more WAL gets replayed.  For example you could have a
> > concurrent autovacuum generating WAL.  
> 
> Autovacuum is turned off in this case, but anyway other kinds of WAL
> records can be generated.

make sense. Fixed in v11.

Please, find in v11 for version 9.5, 9.6 and 10.

Regards,

Attachment

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

23 April 2020, 23:54:02

On Thu, Apr 23, 2020 at 06:59:53PM +0200, Jehan-Guillaume de Rorthais wrote:
> Please, find in v11 for version 9.5, 9.6 and 10.

I have worked more on that using your v11, tweaked few comments in the
new test parts for 9.5~10, and applied the whole.  Thanks all for your
work.  I am keeping now an eye on the buildfarm.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Andres Freund

Date:

24 April 2020, 01:48:56

Hi,

On 2020-04-24 08:54:02 +0900, Michael Paquier wrote:
> On Thu, Apr 23, 2020 at 06:59:53PM +0200, Jehan-Guillaume de Rorthais wrote:
> > Please, find in v11 for version 9.5, 9.6 and 10.
>
> I have worked more on that using your v11, tweaked few comments in the
> new test parts for 9.5~10, and applied the whole.  Thanks all for your
> work.  I am keeping now an eye on the buildfarm.

I am confused by this commit. You added shared memory to differentiate
between crash recovery and standby mode/archive recovery, correct? You
write:

>    This commit fixes the regression by tracking in shared memory if a live
>    cluster is either in crash recovery or archive recovery as the handling
>    of WAL segments ready to be archived is different in both cases (those
>    WAL segments should not be removed during crash recovery), and by using
>    this new shared memory state to decide if a segment can be recycled or
>    not.

But don't we pretty much know this already from the state of the system?
During crash recovery there's nothing running RemoveOldXLogFiles() but
the startup process. Right? And in DB_IN_ARCHIVE_RECOVERY it should only
happen as part of restartpoints (i.e. the checkpointer).

Did you add the new shared state to avoid deducing things from the
"environment"? If so, it should really be mentioned in the commit
message & code. Because:

>    Previously, it was not possible to know if a cluster was in crash
>    recovery or archive recovery as the shared state was able to track only
>    if recovery was happening or not, leading to the problem.

really doesn't make that obvious.

Greetings,

Andres Freund

Re: [BUG] non archived WAL removed during production crash recovery

From

Tom Lane

Date:

24 April 2020, 02:21:15

Michael Paquier <michael@paquier.xyz> writes:
> I have worked more on that using your v11, tweaked few comments in the
> new test parts for 9.5~10, and applied the whole.  Thanks all for your
> work.  I am keeping now an eye on the buildfarm.

Looks like the news is not good :-(

I see that my own florican is one of the failing critters, though
it failed only on HEAD which seems odd.  Any suggestions what to
look for?

            regards, tom lane

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

24 April 2020, 03:43:51

On Thu, Apr 23, 2020 at 10:21:15PM -0400, Tom Lane wrote:
> Looks like the news is not good :-(

Yes, I was looking at that for the last couple of hours, and just
pushed something to put back the buildfarm to a green state for now
(based on the first results things seem stable now) by removing the
defective subset of tests.

> I see that my own florican is one of the failing critters, though
> it failed only on HEAD which seems odd.  Any suggestions what to
> look for?

The issue comes from the parts of the test where we expect some .ready
files to exist (or not) after triggering a restartpoint to force some
segments to be recycled.  And looking more at it, I suspect that the
issue is actually that we don't make sure in the test that the
standbys started have replayed up to the segment switch record
triggered on the primary (the one within generate_series(10,20)), and
then the follow-up restart point does not actually recycle the
segments we expect to recycle.  That's more likely going to be a
problem on slower machines as the window gets wider between the moment
the standbys reach their consistency point and the moment the switch
record is replayed.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

24 April 2020, 07:05:10

On Thu, Apr 23, 2020 at 06:48:56PM -0700, Andres Freund wrote:
> But don't we pretty much know this already from the state of the system?
> During crash recovery there's nothing running RemoveOldXLogFiles() but
> the startup process. Right? And in DB_IN_ARCHIVE_RECOVERY it should only
> happen as part of restartpoints (i.e. the checkpointer).
>
> Did you add the new shared state to avoid deducing things from the
> "environment"? If so, it should really be mentioned in the commit
> message & code. Because:

Hmm.  Sorry, I see your point.  The key of the logic here is from
XLogArchiveCheckDone() which could be called from other processes than
the startup process.  There is one code path at the end of a base
backup for backup history files where not using a shared state would
be a problem.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

24 April 2020, 13:03:00

On Fri, 24 Apr 2020 12:43:51 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Thu, Apr 23, 2020 at 10:21:15PM -0400, Tom Lane wrote:
> > Looks like the news is not good :-(  
> 
> Yes, I was looking at that for the last couple of hours, and just
> pushed something to put back the buildfarm to a green state for now
> (based on the first results things seem stable now) by removing the
> defective subset of tests.
> 
> > I see that my own florican is one of the failing critters, though
> > it failed only on HEAD which seems odd.  Any suggestions what to
> > look for?  
> 
> The issue comes from the parts of the test where we expect some .ready
> files to exist (or not) after triggering a restartpoint to force some
> segments to be recycled.  And looking more at it, I suspect that the
> issue is actually that we don't make sure in the test that the
> standbys started have replayed up to the segment switch record
> triggered on the primary (the one within generate_series(10,20)), and
> then the follow-up restart point does not actually recycle the
> segments we expect to recycle.  That's more likely going to be a
> problem on slower machines as the window gets wider between the moment
> the standbys reach their consistency point and the moment the switch
> record is replayed.

Indeed.

In regard with your fix, as we don't know if the standby caught up with the
latest available record, there's really no point to keep this test either:

  # Recovery with archive_mode=on should not create .ready files.
  # Note that this segment did not exist in the backup.
  ok( !-f "$standby1_data/$segment_path_2_ready",
     ".ready file for WAL segment $segment_name_2 not created on standby
      when archive_mode=on on standby" );

I agree the three tests could be removed as they were not covering the bug we
were chasing. However, they might still be useful to detect futur non expected
behavior changes. If you agree with this, please, find in attachment a patch
proposal against HEAD that recreate these three tests **after** a waiting loop
on both standby1 and standby2. This waiting loop is inspired from the tests in
9.5 -> 10.

Regards,

Attachment

wait-for-wal-replay.patch

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

27 April 2020, 07:49:45

On Fri, Apr 24, 2020 at 03:03:00PM +0200, Jehan-Guillaume de Rorthais wrote:
> I agree the three tests could be removed as they were not covering the bug we
> were chasing. However, they might still be useful to detect futur non expected
> behavior changes. If you agree with this, please, find in attachment a patch
> proposal against HEAD that recreate these three tests **after** a waiting loop
> on both standby1 and standby2. This waiting loop is inspired from the tests in
> 9.5 -> 10.

FWIW, I would prefer keeping all three tests as well.

So..  I have spent more time on this problem and mereswin here is a
very good sample because it failed all three tests:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2020-04-24%2006%3A03%3A53

For standby2, we get this failure:
ok 11 - .ready file for WAL segment 000000010000000000000001 existing
  in backup is kept with archive_mode=always on standby
not ok 12 - .ready file for WAL segment 000000010000000000000002
  created with archive_mode=always on standby

Then, looking at 020_archive_status_standby2.log, we have the
following logs:
2020-04-24 02:08:32.032 PDT [9841:3] 020_archive_status.pl LOG:
statement: CHECKPOINT
[...]
2020-04-24 02:08:32.303 PDT [9821:7] LOG:  restored log file
"000000010000000000000002" from archive

In this case, the test forced a checkpoint to test the segment
recycling *before* the extra restored segment we'd like to work on was
actually restored.  So it looks like my initial feeling about the
timing issue was right, and I am also able to reproduce the original
set of failures by adding a manual sleep to delay restores of
segments, like that for example:
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -74,6 +74,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
    if (recoveryRestoreCommand == NULL ||
    strcmp(recoveryRestoreCommand, "") == 0)
            goto not_available;

+   pg_usleep(10 * 1000000); /* 10s */
+
    /*

With your patch the problem does not show up anymore even with the
delay added, so I would like to apply what you have sent and add back
those tests.  For now, I would just patch HEAD though as that's not
worth the risk of destabilizing stable branches in the buildfarm.

>  $primary->poll_query_until('postgres',
>      q{SELECT archived_count FROM pg_stat_archiver}, '1')
> -  or die "Timed out while waiting for archiving to finish";
> +    or die "Timed out while waiting for archiving to finish";

Some noise in the patch.  This may have come from some unfinished
business with pgindent.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Kyotaro Horiguchi

Date:

27 April 2020, 09:21:07

At Mon, 27 Apr 2020 16:49:45 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Apr 24, 2020 at 03:03:00PM +0200, Jehan-Guillaume de Rorthais wrote:
> > I agree the three tests could be removed as they were not covering the bug we
> > were chasing. However, they might still be useful to detect futur non expected
> > behavior changes. If you agree with this, please, find in attachment a patch
> > proposal against HEAD that recreate these three tests **after** a waiting loop
> > on both standby1 and standby2. This waiting loop is inspired from the tests in
> > 9.5 -> 10.
> 
> FWIW, I would prefer keeping all three tests as well.
> 
> So..  I have spent more time on this problem and mereswin here is a
> very good sample because it failed all three tests:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2020-04-24%2006%3A03%3A53
> 
> For standby2, we get this failure:
> ok 11 - .ready file for WAL segment 000000010000000000000001 existing
>   in backup is kept with archive_mode=always on standby
> not ok 12 - .ready file for WAL segment 000000010000000000000002
>   created with archive_mode=always on standby
> 
> Then, looking at 020_archive_status_standby2.log, we have the
> following logs:
> 2020-04-24 02:08:32.032 PDT [9841:3] 020_archive_status.pl LOG:
> statement: CHECKPOINT
> [...]
> 2020-04-24 02:08:32.303 PDT [9821:7] LOG:  restored log file
> "000000010000000000000002" from archive
> 
> In this case, the test forced a checkpoint to test the segment
> recycling *before* the extra restored segment we'd like to work on was
> actually restored.  So it looks like my initial feeling about the
> timing issue was right, and I am also able to reproduce the original
> set of failures by adding a manual sleep to delay restores of
> segments, like that for example:
> --- a/src/backend/access/transam/xlogarchive.c
> +++ b/src/backend/access/transam/xlogarchive.c
> @@ -74,6 +74,8 @@ RestoreArchivedFile(char *path, const char *xlogfname,
>     if (recoveryRestoreCommand == NULL ||
>     strcmp(recoveryRestoreCommand, "") == 0)
>             goto not_available;
> 
> +   pg_usleep(10 * 1000000); /* 10s */
> +
>     /*
> 
> With your patch the problem does not show up anymore even with the
> delay added, so I would like to apply what you have sent and add back
> those tests.  For now, I would just patch HEAD though as that's not
> worth the risk of destabilizing stable branches in the buildfarm.

Agreed to the diagnosis and the fix. The fix reliably cause a restart
point then the restart point manipulats the status files the right way
before the CHECKPOINT command resturns, in the both cases.

If I would add something to the fix, the following line may need a
comment.

+# Wait for the checkpoint record is replayed so that the following
+# CHECKPOINT causes a restart point reliably.
|+$standby1->poll_query_until('postgres',
|+    qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 }

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

27 April 2020, 10:42:47

On Mon, 27 Apr 2020 18:21:07 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Mon, 27 Apr 2020 16:49:45 +0900, Michael Paquier <michael@paquier.xyz>
> wrote in 
[...]
> > With your patch the problem does not show up anymore even with the
> > delay added, so I would like to apply what you have sent and add back
> > those tests.  For now, I would just patch HEAD though as that's not
> > worth the risk of destabilizing stable branches in the buildfarm.

Good for me. Thanks!

> Agreed to the diagnosis and the fix. The fix reliably cause a restart
> point then the restart point manipulats the status files the right way
> before the CHECKPOINT command resturns, in the both cases.
> 
> If I would add something to the fix, the following line may need a
> comment.
> 
> +# Wait for the checkpoint record is replayed so that the following
> +# CHECKPOINT causes a restart point reliably.
> |+$standby1->poll_query_until('postgres',
> |+    qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(),
> '$primary_lsn') >= 0 }

Agree.

Regards,

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

27 April 2020, 23:01:38

On Mon, Apr 27, 2020 at 06:21:07PM +0900, Kyotaro Horiguchi wrote:
> Agreed to the diagnosis and the fix. The fix reliably cause a restart
> point then the restart point manipulats the status files the right way
> before the CHECKPOINT command resturns, in the both cases.

Thanks for checking!

> If I would add something to the fix, the following line may need a
> comment.
>
> +# Wait for the checkpoint record is replayed so that the following
> +# CHECKPOINT causes a restart point reliably.
> |+$standby1->poll_query_until('postgres',
> |+    qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 }

Makes sense, added a comment and applied to HEAD.  I have also
improved the comment around the split with pg_switch_wal(), and
actually simplified the test to use as wait point the return value
from the function.
--
Michael

Attachment

signature.asc

Re: [BUG] non archived WAL removed during production crash recovery

From

Jehan-Guillaume de Rorthais

Date:

27 April 2020, 23:58:47

On Tue, 28 Apr 2020 08:01:38 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Mon, Apr 27, 2020 at 06:21:07PM +0900, Kyotaro Horiguchi wrote:
> > Agreed to the diagnosis and the fix. The fix reliably cause a restart
> > point then the restart point manipulats the status files the right way
> > before the CHECKPOINT command resturns, in the both cases.  
> 
> Thanks for checking!
> 
> > If I would add something to the fix, the following line may need a
> > comment.
> > 
> > +# Wait for the checkpoint record is replayed so that the following
> > +# CHECKPOINT causes a restart point reliably.
> > |+$standby1->poll_query_until('postgres',
> > |+    qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(),
> > '$primary_lsn') >= 0 }  
> 
> Makes sense, added a comment and applied to HEAD.  I have also
> improved the comment around the split with pg_switch_wal(), and
> actually simplified the test to use as wait point the return value
> from the function.

Thank you guys for your help, reviews and commits!

Re: [BUG] non archived WAL removed during production crash recovery

From

Michael Paquier

Date:

28 April 2020, 00:41:06

On Tue, Apr 28, 2020 at 01:58:47AM +0200, Jehan-Guillaume de Rorthais wrote:
> Thank you guys for your help, reviews and commits!

The buildfarm does not complain after the latest commit, meaning that
we are good here.  Thanks for your efforts.
--
Michael

Attachment

signature.asc