Re: recovery modules - Mailing list pgsql-hackers
From | Nathan Bossart |
---|---|
Subject | Re: recovery modules |
Date | |
Msg-id | 20230123214428.GA572995@nathanxps13 Whole thread Raw |
In response to | Re: recovery modules (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: recovery modules
|
List | pgsql-hackers |
On Mon, Jan 23, 2023 at 11:44:57AM +0900, Michael Paquier wrote: > Thanks for the rebase. Thanks for the detailed review. > The final state of the documentation is as follows: > 51. Archive and Recovery Modules > 51.1. Archive Module Initialization Functions > 51.2. Archive Module Callbacks > 51.3. Recovery Module Initialization Functions > 51.4. Recovery Module Callbacks > > I am not completely sure whether this grouping is the best thing to > do. Wouldn't it be better to move that into two different > sub-sections instead? One layout suggestion: > 51. WAL Modules > 51.1. Archive Modules > 51.1.1. Archive Module Initialization Functions > 51.1.2. Archive Module Callbacks > 51.2. Recovery Modules > 51.2.1. Recovery Module Initialization Functions > 51.2.2. Recovery Module Callbacks > > Putting both of them in the same section sounds like a good idea per > the symmetry that one would likely have between the code paths of > archiving and recovery, so as they share some common knowledge. > > This kinds of comes back to the previous suggestion to rename > basic_archive to something like wal_modules, that covers both > archiving and recovery. I does not seem like this would overlap with > RMGRs, which is much more internal anyway. I updated the documentation as you suggested, and I renamed basic_archive to basic_wal_module. > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > - errmsg("must specify restore_command when standby mode is not enabled"))); > + errmsg("must specify restore_command or a restore_library that defines " > + "a restore callback when standby mode is not enabled"))); > This is long. Shouldn't this be split into an errdetail() to list all > the options at hand? Should the errmsg() be something like "recovery parameters misconfigured"? > - if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0') > - ereport(ERROR, > - (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > - errmsg("both archive_command and archive_library set"), > - errdetail("Only one of archive_command, archive_library may be set."))); > + CheckMutuallyExclusiveGUCs(XLogArchiveLibrary, "archive_library", > + XLogArchiveCommand, "archive_command"); > > The introduction of this routine could be a patch on its own, as it > impacts the archiving path. I moved this to a separate patch. > - Startup reloading, as of StartupRereadConfig(). This code could > involve a WAL receiver restart depending on a change in the slot > change or in primary_conninfo, and force at the same time an ERROR > because of conflicting recovery library and command configuration. > This one should be safe because pendingWalRcvRestart would just be > considered later on by the startup process itself while waiting for > WAL to become available. Still this could deserve a comment? Even if > there is a misconfiguration, a reload on a standby would enforce a > FATAL in the startup process, taking down the whole server. Do you think the parameter checks should go before the WAL receiver restart logic? > - Checkpointer initialization, as of CheckpointerMain(). A > configuration failure in this code path, aka server startup, causes > the server to loop infinitely on FATAL with the misconfiguration > showing up all the time.. This is a problem. Perhaps this is a reason to move the parameter check in CheckpointerMain() to after the sigsetjmp() block. This should avoid full server restarts. Only the checkpointer process would loop with the ERROR. > - Last comes the checkpointer GUC reloading, as of > HandleCheckpointerInterrupts(), with a second problem. This > introduces a failure path where ConfigReloadPending is processed at > the same time as ShutdownRequestPending based on the way it is coded, > interacting with what would be a normal shutdown in some cases? And > actually, if you enforce a misconfiguration on reload, the > checkpointer reports an error but it does not enforce a process > restart, hence it keeps around the new, incorrect, configuration while > waiting for a new checkpoint to happen once restore_library and > archive_cleanup_command are set. This could lead to surprises, IMO. > Upgrading to a FATAL in this code path triggers an infinite loop, like > the startup path. If we move the parameter check in CheckpointerMain() as described above, the checkpointer should be unable to proceed with an incorrect configuration. For the normal shutdown part, do you think the ShutdownRequestPending block should be moved to before the ConfigReloadPending block in HandleCheckpointerInterrupts()? > If the archive_cleanup_command ballback of a restore library triggers > a FATAL, it seems to me that it would continuously trigger a server > restart, actually. Perhaps that's something to document, in > comparison to the safe fallbacks of the shell command where we don't > force an ERROR to give priority to the stability of the checkpointer. I'm not sure it's worth documenting that ereport(FATAL, ...) in the checkpointer process will cause a server restart. In most cases, an extension author would use ERROR, which, if we make the aforementioned changes, would at most cause the checkpointer to effectively restart. This is similar to archive modules where an ERROR causes only the archiver process to restart. Also, we document that recovery libraries are loaded in the startup and checkpointer processes, so IMO it should be relatively apparent that doing something like FATAL or proc_exit() is bad. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: