Re: [BUG] non archived WAL removed during production crash recovery - Mailing list pgsql-bugs
From | Jehan-Guillaume de Rorthais |
---|---|
Subject | Re: [BUG] non archived WAL removed during production crash recovery |
Date | |
Msg-id | 20200417153304.01a226cc@firost Whole thread Raw |
In response to | Re: [BUG] non archived WAL removed during production crash recovery (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: [BUG] non archived WAL removed during production crash recovery
|
List | pgsql-bugs |
On Fri, 17 Apr 2020 15:50:43 +0900 Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Apr 17, 2020 at 12:07:39AM +0200, Jehan-Guillaume de Rorthais wrote: > > On Thu, 16 Apr 2020 17:11:00 +0900 Michael Paquier <michael@paquier.xyz> > > wrote: > >> Removing .done files does not matter as the segments related to them are > >> already gone. > > > > Not necessarily. There's a time windows between the moment the archiver set > > the .done file and when the checkpointer removes the associated WAL file. > > So, after a PANIC because of lack of space, WAL associated with .done files > > but not removed yet will be removed during the crash recovery. > > Not sure that it is something that matters for this thread though, so > if necessary I think that it could be be discussed separately. OK. However, unless I'm wrong, what I am describing as an desired behavior is the current behavior of XLogArchiveCheckDone. So, we might want to decide if v8 should return false during crash recovery no matter the archive_mode setup, or if we keep the curent behavior. I vote for keeping it this way. [...] > > Unless I'm wrong, the empty string does not raise an error in > > pg_stat_archiver, and I wanted to add a test on this as well. > > Exactly, it won't raise an error. Instead I switched to use a poll > query with pg_stat_file() and .ready files, but this has proved to > delay the test considerably if we did not create more segments. And > your approach has the merit to be more simple with only two segments > manipulated for the whole test. So I have tried first my idea, > noticed the mess it introduced, and just kept your approach. Maybe we could use something more common for all plateform? Eg.: archive_command='this command does not exist' At least, we would have the same error everywhere, as far as it could matter... > > Thanks for your review! Let me know if you want me to add/change/fix some > > tests. > > Thanks, I have worked more on the test, refactoring pieces related to > the segment names, adjusting some comments and fixing some of the > logic. Note that you introduced something incorrect at the creation > of $standby2 as you have been updating postgresql.conf.auto for > $standby1. erf, last minute quick edit with lack of review on my side :( > I have noticed an extra issue while looking at the backend pieces > today: at the beginning of the REDO loop we forgot one place where > SharedRecoveryState *has* to be updated to a correct state (around > the comment "Update pg_control to show that we are..." in xlog.c) as > the startup process may decide to switch the control file state to > DB_IN_ARCHIVE_RECOVERY or DB_IN_CRASH_RECOVERY, but we forgot to > update the new shared flag at this early stage. It did not matter > before because SharedRecoveryInProgress would be only "true" for both, > but that's not the case anymore as we need to make the difference > between crash recovery and archive recovery in the new flag. There is > no actual need to update SharedRecoveryState to RECOVERY_STATE_CRASH > as the initial shared memory state is RECOVERY_STATE_CRASH, but > updating the flag makes the code more consistent IMCRASHO so I updated it > anyway in the attached. Grmbl...I had this logic the other way around: init with RECOVERY_STATE_RECOVERY and set to CRASH in this exact if/then/else block... I removed it in v4 when setting XLogCtl->SharedRecoveryState to RECOVERY or CRASH based on ControlFile->state. Sorry, I forgot it after discussing the init value in v5 :( Regards,
pgsql-bugs by date: