Re: [BUG] non archived WAL removed during production crash recovery - Mailing list pgsql-bugs

From Jehan-Guillaume de Rorthais
Subject Re: [BUG] non archived WAL removed during production crash recovery
Date
Msg-id 20200417153304.01a226cc@firost
Whole thread Raw
In response to Re: [BUG] non archived WAL removed during production crash recovery  (Michael Paquier <michael@paquier.xyz>)
Responses Re: [BUG] non archived WAL removed during production crash recovery
List pgsql-bugs
On Fri, 17 Apr 2020 15:50:43 +0900
Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote:
> > On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael@paquier.xyz>
> > wrote:  
> >> Removing .done files does not matter as the segments related to them are
> >> already gone.  
> > 
> > Not necessarily. There's a time windows between the moment the archiver set
> > the .done file and when the checkpointer removes the associated WAL file.
> > So, after a PANIC because of lack of space, WAL associated with .done files
> > but not removed yet will be removed during the crash recovery.  
> 
> Not sure that it is something that matters for this thread though, so
> if necessary I think that it could be be discussed separately.

OK. However, unless I'm wrong, what I am describing as an desired behavior
is the current behavior of XLogArchiveCheckDone. So, we might want to decide if
v8 should return false during crash recovery no matter the archive_mode setup,
or if we keep the curent behavior. I vote for keeping it this way.

[...]
> > Unless I'm wrong, the empty string does not raise an error in
> > pg_stat_archiver, and I wanted to add a test on this as well.  
> 
> Exactly, it won't raise an error.  Instead I switched to use a poll
> query with pg_stat_file() and .ready files, but this has proved to
> delay the test considerably if we did not create more segments.  And
> your approach has the merit to be more simple with only two segments
> manipulated for the whole test.  So I have tried first my idea,
> noticed the mess it introduced, and just kept your approach.

Maybe we could use something more common for all plateform? Eg.:

  archive_command='this command does not exist'

At least, we would have the same error everywhere, as far as it could matter...

> > Thanks for your review! Let me know if you want me to add/change/fix some
> > tests.  
> 
> Thanks, I have worked more on the test, refactoring pieces related to
> the segment names, adjusting some comments and fixing some of the
> logic.  Note that you introduced something incorrect at the creation
> of $standby2 as you have been updating postgresql.conf.auto for
> $standby1.

erf, last minute quick edit with lack of review on my side :(

> I have noticed an extra issue while looking at the backend pieces
> today: at the beginning of the REDO loop we forgot one place where
> SharedRecoveryState *has* to be updated to a correct state (around
> the comment "Update pg_control to show that we are..." in xlog.c) as
> the startup process may decide to switch the control file state to
> DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to
> update the new shared flag at this early stage.  It did not matter
> before because SharedRecoveryInProgress would be only "true" for both,
> but that's not the case anymore as we need to make the difference
> between crash recovery and archive recovery in the new flag.  There is
> no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH
> as the initial shared memory state is RECOVERY_STATE_CRASH, but
> updating the flag makes the code more consistent IMCRASHO so I updated it
> anyway in the attached.

Grmbl...I had this logic the other way around: init with
RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I
removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH
based on ControlFile->state.

Sorry, I forgot it after discussing the init value in v5 :(

Regards,



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16374: I can't directly change owner from my created database to my created user.
Next
From: Tom Lane
Date:
Subject: Re: BUG #16373: Behavior of Temporary table creation