Re: recovery modules - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: recovery modules
Date
Msg-id 20230123214428.GA572995@nathanxps13
Whole thread Raw
In response to Re: recovery modules  (Michael Paquier <michael@paquier.xyz>)
Responses Re: recovery modules
List pgsql-hackers
On Mon, Jan 23, 2023 at 11:44:57AM +0900, Michael Paquier wrote:
> Thanks for the rebase.

Thanks for the detailed review.

> The final state of the documentation is as follows:
> 51. Archive and Recovery Modules
>     51.1. Archive Module Initialization Functions
>     51.2. Archive Module Callbacks
>     51.3. Recovery Module Initialization Functions
>     51.4. Recovery Module Callbacks
> 
> I am not completely sure whether this grouping is the best thing to
> do.  Wouldn't it be better to move that into two different
> sub-sections instead?  One layout suggestion:
> 51. WAL Modules
>   51.1. Archive Modules
>     51.1.1. Archive Module Initialization Functions
>     51.1.2. Archive Module Callbacks
>   51.2. Recovery Modules
>     51.2.1. Recovery Module Initialization Functions
>     51.2.2. Recovery Module Callbacks
> 
> Putting both of them in the same section sounds like a good idea per
> the symmetry that one would likely have between the code paths of
> archiving and recovery, so as they share some common knowledge.
> 
> This kinds of comes back to the previous suggestion to rename
> basic_archive to something like wal_modules, that covers both
> archiving and recovery.  I does not seem like this would overlap with
> RMGRs, which is much more internal anyway.

I updated the documentation as you suggested, and I renamed basic_archive
to basic_wal_module.

>      (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> -     errmsg("must specify restore_command when standby mode is not enabled")));
> +     errmsg("must specify restore_command or a restore_library that defines "
> +            "a restore callback when standby mode is not enabled")));
> This is long.  Shouldn't this be split into an errdetail() to list all
> the options at hand?

Should the errmsg() be something like "recovery parameters misconfigured"?

> -   if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
> -       ereport(ERROR,
> -               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> -                errmsg("both archive_command and archive_library set"),
> -                errdetail("Only one of archive_command, archive_library may be set.")));
> +   CheckMutuallyExclusiveGUCs(XLogArchiveLibrary, "archive_library",
> +                              XLogArchiveCommand, "archive_command");
> 
> The introduction of this routine could be a patch on its own, as it
> impacts the archiving path.

I moved this to a separate patch.

> - Startup reloading, as of StartupRereadConfig().  This code could
> involve a WAL receiver restart depending on a change in the slot
> change or in primary_conninfo, and force at the same time an ERROR
> because of conflicting recovery library and command configuration.
> This one should be safe because pendingWalRcvRestart would just be
> considered later on by the startup process itself while waiting for
> WAL to become available.  Still this could deserve a comment?  Even if
> there is a misconfiguration, a reload on a standby would enforce a
> FATAL in the startup process, taking down the whole server.

Do you think the parameter checks should go before the WAL receiver restart
logic?

> - Checkpointer initialization, as of CheckpointerMain().  A
> configuration failure in this code path, aka server startup, causes
> the server to loop infinitely on FATAL with the misconfiguration
> showing up all the time..  This is a problem.

Perhaps this is a reason to move the parameter check in CheckpointerMain()
to after the sigsetjmp() block.  This should avoid full server restarts.
Only the checkpointer process would loop with the ERROR.

> - Last comes the checkpointer GUC reloading, as of
> HandleCheckpointerInterrupts(), with a second problem.  This
> introduces a failure path where ConfigReloadPending is processed at
> the same time as ShutdownRequestPending based on the way it is coded,
> interacting with what would be a normal shutdown in some cases?  And
> actually, if you enforce a misconfiguration on reload, the
> checkpointer reports an error but it does not enforce a process
> restart, hence it keeps around the new, incorrect, configuration while
> waiting for a new checkpoint to happen once restore_library and
> archive_cleanup_command are set.  This could lead to surprises, IMO.
> Upgrading to a FATAL in this code path triggers an infinite loop, like
> the startup path.

If we move the parameter check in CheckpointerMain() as described above,
the checkpointer should be unable to proceed with an incorrect
configuration.  For the normal shutdown part, do you think the
ShutdownRequestPending block should be moved to before the
ConfigReloadPending block in HandleCheckpointerInterrupts()?

> If the archive_cleanup_command ballback of a restore library triggers
> a FATAL, it seems to me that it would continuously trigger a server
> restart, actually.  Perhaps that's something to document, in
> comparison to the safe fallbacks of the shell command where we don't
> force an ERROR to give priority to the stability of the checkpointer.

I'm not sure it's worth documenting that ereport(FATAL, ...) in the
checkpointer process will cause a server restart.  In most cases, an
extension author would use ERROR, which, if we make the aforementioned
changes, would at most cause the checkpointer to effectively restart.  This
is similar to archive modules where an ERROR causes only the archiver
process to restart.  Also, we document that recovery libraries are loaded
in the startup and checkpointer processes, so IMO it should be relatively
apparent that doing something like FATAL or proc_exit() is bad.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Non-decimal integer literals
Next
From: Corey Huinker
Date:
Subject: Re: Add SHELL_EXIT_CODE to psql