Thread: allow specifying action when standby encounters incompatible parameter settings

allow specifying action when standby encounters incompatible parameter settings

From
Nathan Bossart
Date:
Hi hackers,

As of 15251c0, when a standby encounters an incompatible parameter change,
it pauses replay so that read traffic can continue while the administrator
fixes the parameters.  Once the server is restarted, replay can continue.
Before this change, such incompatible parameter changes caused the standby
to immediately shut down.

I noticed that there was some suggestion in the thread associated with
15251c0 [0] for making this behavior configurable, but there didn't seem to
be much interest at the time.  I am interested in allowing administrators
to specify the behavior before 15251c0 (i.e., immediately shut down the
standby when an incompatible parameter change is detected).  The use-case I
have in mind is when an administrator has automation in place for adjusting
these parameters and would like to avoid stopping replay any longer than
necessary.  FWIW this is what we do in RDS.

I've attached a patch that adds a new GUC where users can specify the
action to take when an incompatible parameter change is detected on a
standby.  For now, there are just two options: 'pause' and 'shutdown'.
This new GUC is largely modeled after recovery_target_action.

I initially set out to see if it was possible to automatically adjust these
parameters on a standby, but that is considerably more difficult.  It isn't
enough to just hook into the restart_after_crash functionality since it
doesn't go back far enough in the postmaster logic.  IIUC we'd need to
reload preloaded libraries (which there is presently no support for),
recalculate MaxBackends, etc.  Another option I considered was to
automatically adjust the parameters during startup so that you just need to
restart the server.  However, we need to know for sure that the server is
going to be a hot standby, and I don't believe we have that information
where such GUC changes would need to occur (I could be wrong about this).
Anyway, for now I'm just proposing the modest change described above, but
I'd welcome any discussion about improving matters further in this area.

[0] https://postgr.es/m/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b%402ndquadrant.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: allow specifying action when standby encounters incompatible parameter settings

From
Kyotaro Horiguchi
Date:
At Wed, 13 Apr 2022 14:35:21 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> Hi hackers,
> 
> As of 15251c0, when a standby encounters an incompatible parameter change,
> it pauses replay so that read traffic can continue while the administrator
> fixes the parameters.  Once the server is restarted, replay can continue.
> Before this change, such incompatible parameter changes caused the standby
> to immediately shut down.
> 
> I noticed that there was some suggestion in the thread associated with
> 15251c0 [0] for making this behavior configurable, but there didn't seem to
> be much interest at the time.  I am interested in allowing administrators
> to specify the behavior before 15251c0 (i.e., immediately shut down the
> standby when an incompatible parameter change is detected).  The use-case I
> have in mind is when an administrator has automation in place for adjusting
> these parameters and would like to avoid stopping replay any longer than
> necessary.  FWIW this is what we do in RDS.
> 
> I've attached a patch that adds a new GUC where users can specify the
> action to take when an incompatible parameter change is detected on a
> standby.  For now, there are just two options: 'pause' and 'shutdown'.
> This new GUC is largely modeled after recovery_target_action.

The overall direction of going to shutdown without needing user
interaction seems fine.  I think the same can be done by
timeout. That is, we provide a GUC named like
insufficient_standby_setting_shutdown_timeout (mmm. too long..), then
recovery sits down for the duration then shuts down. -1 means the
current behavior, 0 means what this patch is going to
introduce. However I don't see a concrete use case of the timeout.

> I initially set out to see if it was possible to automatically adjust these
> parameters on a standby, but that is considerably more difficult.  It isn't
> enough to just hook into the restart_after_crash functionality since it
> doesn't go back far enough in the postmaster logic.  IIUC we'd need to
> reload preloaded libraries (which there is presently no support for),
> recalculate MaxBackends, etc.  Another option I considered was to

Sure.

> automatically adjust the parameters during startup so that you just need to
> restart the server.  However, we need to know for sure that the server is
> going to be a hot standby, and I don't believe we have that information
> where such GUC changes would need to occur (I could be wrong about this).

Conldn't we use AlterSystemSetConfigFile for this purpose in
CheckRequiredParameterValues?

> Anyway, for now I'm just proposing the modest change described above, but
> I'd welcome any discussion about improving matters further in this area.
> 
> [0] https://postgr.es/m/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b%402ndquadrant.com

Is the reason for the enum the extensibility to add a new choice like
"auto-adjust"?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Thanks for taking a look!

On Thu, Apr 14, 2022 at 11:36:11AM +0900, Kyotaro Horiguchi wrote:
> At Wed, 13 Apr 2022 14:35:21 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
>> I initially set out to see if it was possible to automatically adjust these
>> parameters on a standby, but that is considerably more difficult.  It isn't
>> enough to just hook into the restart_after_crash functionality since it
>> doesn't go back far enough in the postmaster logic.  IIUC we'd need to
>> reload preloaded libraries (which there is presently no support for),
>> recalculate MaxBackends, etc.  Another option I considered was to
> 
> Sure.
> 
>> automatically adjust the parameters during startup so that you just need to
>> restart the server.  However, we need to know for sure that the server is
>> going to be a hot standby, and I don't believe we have that information
>> where such GUC changes would need to occur (I could be wrong about this).
> 
> Conldn't we use AlterSystemSetConfigFile for this purpose in
> CheckRequiredParameterValues?

That's an interesting idea.  When an incompatible parameter change is
detected, the server would automatically run the equivalent of an ALTER
SYSTEM SET command before shutting down or pausing replay.  I might draft
up a proof-of-concept to see what this looks like and how well it works.

>> Anyway, for now I'm just proposing the modest change described above, but
>> I'd welcome any discussion about improving matters further in this area.
>> 
>> [0] https://postgr.es/m/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b%402ndquadrant.com
> 
> Is the reason for the enum the extensibility to add a new choice like
> "auto-adjust"?

Yes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

Hello

The patch applied and tested fine. I think for this kind of exception, it really is up to the administrator how he/she
shouldproceed to resolve depending on his/her business application. Leaving things configurable by the user is
generallya nice and modest change. I also like that you leave the parameters as enum entry so it is possible to extend
otherpossible actions such as automatically adjust to match the new value. 
 


---------
Cary Huang
HighGo Software Canada
On Fri, Apr 29, 2022 at 06:35:52PM +0000, Cary Huang wrote:
> The patch applied and tested fine. I think for this kind of exception, it really is up to the administrator how
he/sheshould proceed to resolve depending on his/her business application. Leaving things configurable by the user is
generallya nice and modest change. I also like that you leave the parameters as enum entry so it is possible to extend
otherpossible actions such as automatically adjust to match the new value. 
 

Thanks for reviewing!  Do you think this patch can be marked as
ready-for-committer?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Wed, 13 Apr 2022 at 22:35, Nathan Bossart <nathandbossart@gmail.com> wrote:

> As of 15251c0, when a standby encounters an incompatible parameter change,
> it pauses replay so that read traffic can continue while the administrator
> fixes the parameters.  Once the server is restarted, replay can continue.
> Before this change, such incompatible parameter changes caused the standby
> to immediately shut down.
>
> I noticed that there was some suggestion in the thread associated with
> 15251c0 [0] for making this behavior configurable, but there didn't seem to
> be much interest at the time.  I am interested in allowing administrators
> to specify the behavior before 15251c0 (i.e., immediately shut down the
> standby when an incompatible parameter change is detected).  The use-case I
> have in mind is when an administrator has automation in place for adjusting
> these parameters and would like to avoid stopping replay any longer than
> necessary.  FWIW this is what we do in RDS.

I don't understand why you need this patch at all.

Since you have automation, you can use that layer to automatically
restart all standbys at once, if you choose, without any help or
hindrance from PostgreSQL.

But I really don't want you to do that, since that results in loss of
availability of the service. I'd like you to try a little harder and
automate this in a way that allows the service to continue with some
standbys available while others restart.

All this feature does is allow you to implement things in a lazy way
that causes a loss of availability for users. I don't think that is of
benefit to PostgreSQL users, so -1 from me, on this patch (only),
sorry about that.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Thanks for chiming in.

On Thu, Jun 23, 2022 at 04:19:45PM +0100, Simon Riggs wrote:
> I don't understand why you need this patch at all.
> 
> Since you have automation, you can use that layer to automatically
> restart all standbys at once, if you choose, without any help or
> hindrance from PostgreSQL.
> 
> But I really don't want you to do that, since that results in loss of
> availability of the service. I'd like you to try a little harder and
> automate this in a way that allows the service to continue with some
> standbys available while others restart.
> 
> All this feature does is allow you to implement things in a lazy way
> that causes a loss of availability for users. I don't think that is of
> benefit to PostgreSQL users, so -1 from me, on this patch (only),
> sorry about that.

Overall, this is intended for users that care more about keeping WAL replay
caught up than a temporary loss of availability due to a restart.  Without
this, I'd need to detect that WAL replay has paused due to insufficient
parameters and restart Postgres.  If І can configure Postgres to
automatically shut down in these scenarios, my automation can skip right to
adjusting the parameters and starting Postgres up.  Of course, if you care
more about availability, you'd keep this parameter set to the default
(pause) and restart on your own schedule.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Thu, 23 Jun 2022 at 18:45, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Thanks for chiming in.
>
> On Thu, Jun 23, 2022 at 04:19:45PM +0100, Simon Riggs wrote:
> > I don't understand why you need this patch at all.
> >
> > Since you have automation, you can use that layer to automatically
> > restart all standbys at once, if you choose, without any help or
> > hindrance from PostgreSQL.
> >
> > But I really don't want you to do that, since that results in loss of
> > availability of the service. I'd like you to try a little harder and
> > automate this in a way that allows the service to continue with some
> > standbys available while others restart.
> >
> > All this feature does is allow you to implement things in a lazy way
> > that causes a loss of availability for users. I don't think that is of
> > benefit to PostgreSQL users, so -1 from me, on this patch (only),
> > sorry about that.
>
> Overall, this is intended for users that care more about keeping WAL replay
> caught up than a temporary loss of availability due to a restart.  Without
> this, I'd need to detect that WAL replay has paused due to insufficient
> parameters and restart Postgres. If І can configure Postgres to
> automatically shut down in these scenarios, my automation can skip right to
> adjusting the parameters and starting Postgres up.  Of course, if you care
> more about availability, you'd keep this parameter set to the default
> (pause) and restart on your own schedule.

There are a few choices of how we can deal with this situation
1. Make the change blindly and then pick up the pieces afterwards
2. Check the configuration before changes are made, and make the
changes in the right order

This patch and the above argument assumes that you must do (1), but
you could easily do (2).

i.e. If you know that changing specific parameters might affect
availability, why not query those parameter values on all servers
first and check whether the change will affect availability, before
you allow the changes? why rely on PostgreSQL to pick up the pieces
because the orchestration code doesn't (yet) make configuration sanity
checks?

This patch would undo a very important change - to keep servers
available by default and go back to the old behavior for a huge fleet
of Postgres databases. The old behavior of shutdown-on-change caused
catastrophe many times for users and in one case brought down a rather
large and important service provider, whose CTO explained to me quite
clearly how stupid he thought that old behavior was. So I will not
easily agree to allowing it to be put back into PostgreSQL, simply to
avoid adding a small amount of easy code into an orchestration layer
that could and should implement documented best practice.

I am otherwise very appreciative of your insightful contributions,
just not this specific one.

Happy to discuss how we might introduce new parameters/behavior to
reduce the list of parameters that need to be kept in lock-step.

--
Simon Riggs                http://www.EnterpriseDB.com/



On Fri, Jun 24, 2022 at 11:42:29AM +0100, Simon Riggs wrote:
> This patch would undo a very important change - to keep servers
> available by default and go back to the old behavior for a huge fleet
> of Postgres databases. The old behavior of shutdown-on-change caused
> catastrophe many times for users and in one case brought down a rather
> large and important service provider, whose CTO explained to me quite
> clearly how stupid he thought that old behavior was. So I will not
> easily agree to allowing it to be put back into PostgreSQL, simply to
> avoid adding a small amount of easy code into an orchestration layer
> that could and should implement documented best practice.
> 
> I am otherwise very appreciative of your insightful contributions,
> just not this specific one.

Given this feedback, I intend to mark the associated commitfest entry as
Withdrawn at the conclusion of the current commitfest.

> Happy to discuss how we might introduce new parameters/behavior to
> reduce the list of parameters that need to be kept in lock-step.

Me, too.  I don't have anything concrete to propose at the moment.  I
thought Horiguchi-san's idea of automatically running ALTER SYSTEM was
intriguing, but I have not explored that in depth.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Mon, Jul 18, 2022 at 03:17:10PM -0700, Nathan Bossart wrote:
> Given this feedback, I intend to mark the associated commitfest entry as
> Withdrawn at the conclusion of the current commitfest.

Done.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com