Re: Weird failure with latches in curculio on v15 - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Weird failure with latches in curculio on v15
Date
Msg-id Y979NR3U5VnWrTwB@paquier.xyz
Whole thread Raw
In response to Re: Weird failure with latches in curculio on v15  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: Weird failure with latches in curculio on v15
Re: Weird failure with latches in curculio on v15
List pgsql-hackers
On Sat, Feb 04, 2023 at 10:03:54AM -0800, Nathan Bossart wrote:
> Okay.  Michael, why don't we revert the shell_restore stuff for now?  Once
> the archive modules interface changes and the fix for this
> SIGTERM-during-system() problem are in, I will work through this feedback
> and give recovery modules another try.  I'm still hoping to have recovery
> modules ready in time for the v16 feature freeze.

Yes, at this stage a revert of the refactoring with shell_restore.c is
the best path forward.

From the discussion, I got the following things on top of my mind, for
reference:
- Should we include archive_cleanup_command into the recovery modules
at all?  We've discussed offloading that from the checkpointer, and it
makes the failure handling trickier when it comes to unexpected GUC
configurations, for one.  The same may actually apply to
restore_end_command.  Though it is done in the startup process now,
there may be an argument to offload that somewhere else based on the
timing of the end-of-recovery checkpoint.  My opinion on this stuff is
that only including restore_command in the modules would make most
users I know of happy enough as it removes the overhead of the command
invocation from the startup process, if able to replay things fast
enough so as the restore command is the bottleneck.
restore_end_command would be simple enough, but if there is a wish to
redesign the startup process to offload it somewhere else, then the
recovery module makes backward-compatibility concerns harder to think
about in the long-term.
- Do we need to reconsider the assumptions of the startup
process where SIGTERM enforces an immediate shutdown while running
system() for the restore command?  For example, the difference of
behavior when a restore_command uses a system sleep() that does not
react on signals from what I recall?
- Fixing the original issue of this thread may finish by impacting
what you are trying to do in this area, so fixing the original issue
first sounds  like a pre-requirement to me at the end because it may
impact the final design of the modules and their callbacks.  (I have
not looked at all the arguments raised about what to do with ~15,
still it does not look like we have a clear picture here yet.)
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: File descriptors in exec'd subprocesses
Next
From: Michael Paquier
Date:
Subject: Re: undersized unions