On Thu, Jan 26, 2023 at 09:40:58PM -0800, Nathan Bossart wrote:
> On Wed, Jan 25, 2023 at 04:34:21PM +0900, Michael Paquier wrote:
>> The loop part is annoying.. I've never been a fan of adding this
>> cross-value checks for the archiver modules in the first place, and it
>> would make things much simpler in the checkpointer if we need to think
>> about that as we want these values to be reloadable. Perhaps this
>> could just be an exception where we just give priority on one over the
>> other archive_cleanup_command? The startup process has a well-defined
>> sequence after a failure, while the checkpointer is designed to remain
>> robust.
>
> Yeah, there are some problems here. If we ERROR, we'll just bounce back to
> the sigsetjmp() block once a second, and we'll never pick up configuration
> reloads, shutdown signals, etc. If we FATAL, we'll just rapidly restart
> over and over. Given the dicussion about misconfigured archiving
> parameters [0], I doubt folks will be okay with giving priority to one or
> the other.
>
> I'm currently thinking that the checkpointer should set a flag and clear
> the recovery callbacks when a misconfiguration is detected. Anytime the
> checkpointer tries to use the archive-cleanup callback, a WARNING would be
> emitted. This is similar to an approach I proposed for archiving
> misconfigurations (that we didn't proceed with) [1]. Given the
> aformentioned problems, this approach might be more suitable for the
> checkpointer than it is for the archiver.
The more I think about this, the more I wonder whether we really need to
include archive_cleanup_command and recovery_end_command in recovery
modules. Another weird thing with the checkpointer is that the
restore_library will stay loaded long after recovery is finished, and it'll
be loaded regardless of whether recovery is required in the first place.
Of course, that typically won't cause any problems, and we could wait until
we need to do archive cleanup to load the library (and call its shutdown
callback when recovery is finished), but this strikes me as potentially
more complexity than the feature is worth. Perhaps we should just focus on
covering the restore_command functionality for now and add the rest later.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com