Thread: allow specifying action when standby encounters incompatible parameter settings
allow specifying action when standby encounters incompatible parameter settings
From
Nathan Bossart
Date:
Hi hackers, As of 15251c0, when a standby encounters an incompatible parameter change, it pauses replay so that read traffic can continue while the administrator fixes the parameters. Once the server is restarted, replay can continue. Before this change, such incompatible parameter changes caused the standby to immediately shut down. I noticed that there was some suggestion in the thread associated with 15251c0 [0] for making this behavior configurable, but there didn't seem to be much interest at the time. I am interested in allowing administrators to specify the behavior before 15251c0 (i.e., immediately shut down the standby when an incompatible parameter change is detected). The use-case I have in mind is when an administrator has automation in place for adjusting these parameters and would like to avoid stopping replay any longer than necessary. FWIW this is what we do in RDS. I've attached a patch that adds a new GUC where users can specify the action to take when an incompatible parameter change is detected on a standby. For now, there are just two options: 'pause' and 'shutdown'. This new GUC is largely modeled after recovery_target_action. I initially set out to see if it was possible to automatically adjust these parameters on a standby, but that is considerably more difficult. It isn't enough to just hook into the restart_after_crash functionality since it doesn't go back far enough in the postmaster logic. IIUC we'd need to reload preloaded libraries (which there is presently no support for), recalculate MaxBackends, etc. Another option I considered was to automatically adjust the parameters during startup so that you just need to restart the server. However, we need to know for sure that the server is going to be a hot standby, and I don't believe we have that information where such GUC changes would need to occur (I could be wrong about this). Anyway, for now I'm just proposing the modest change described above, but I'd welcome any discussion about improving matters further in this area. [0] https://postgr.es/m/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b%402ndquadrant.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Re: allow specifying action when standby encounters incompatible parameter settings
From
Kyotaro Horiguchi
Date:
At Wed, 13 Apr 2022 14:35:21 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > Hi hackers, > > As of 15251c0, when a standby encounters an incompatible parameter change, > it pauses replay so that read traffic can continue while the administrator > fixes the parameters. Once the server is restarted, replay can continue. > Before this change, such incompatible parameter changes caused the standby > to immediately shut down. > > I noticed that there was some suggestion in the thread associated with > 15251c0 [0] for making this behavior configurable, but there didn't seem to > be much interest at the time. I am interested in allowing administrators > to specify the behavior before 15251c0 (i.e., immediately shut down the > standby when an incompatible parameter change is detected). The use-case I > have in mind is when an administrator has automation in place for adjusting > these parameters and would like to avoid stopping replay any longer than > necessary. FWIW this is what we do in RDS. > > I've attached a patch that adds a new GUC where users can specify the > action to take when an incompatible parameter change is detected on a > standby. For now, there are just two options: 'pause' and 'shutdown'. > This new GUC is largely modeled after recovery_target_action. The overall direction of going to shutdown without needing user interaction seems fine. I think the same can be done by timeout. That is, we provide a GUC named like insufficient_standby_setting_shutdown_timeout (mmm. too long..), then recovery sits down for the duration then shuts down. -1 means the current behavior, 0 means what this patch is going to introduce. However I don't see a concrete use case of the timeout. > I initially set out to see if it was possible to automatically adjust these > parameters on a standby, but that is considerably more difficult. It isn't > enough to just hook into the restart_after_crash functionality since it > doesn't go back far enough in the postmaster logic. IIUC we'd need to > reload preloaded libraries (which there is presently no support for), > recalculate MaxBackends, etc. Another option I considered was to Sure. > automatically adjust the parameters during startup so that you just need to > restart the server. However, we need to know for sure that the server is > going to be a hot standby, and I don't believe we have that information > where such GUC changes would need to occur (I could be wrong about this). Conldn't we use AlterSystemSetConfigFile for this purpose in CheckRequiredParameterValues? > Anyway, for now I'm just proposing the modest change described above, but > I'd welcome any discussion about improving matters further in this area. > > [0] https://postgr.es/m/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b%402ndquadrant.com Is the reason for the enum the extensibility to add a new choice like "auto-adjust"? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: allow specifying action when standby encounters incompatible parameter settings
From
Nathan Bossart
Date:
Thanks for taking a look! On Thu, Apr 14, 2022 at 11:36:11AM +0900, Kyotaro Horiguchi wrote: > At Wed, 13 Apr 2022 14:35:21 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in >> I initially set out to see if it was possible to automatically adjust these >> parameters on a standby, but that is considerably more difficult. It isn't >> enough to just hook into the restart_after_crash functionality since it >> doesn't go back far enough in the postmaster logic. IIUC we'd need to >> reload preloaded libraries (which there is presently no support for), >> recalculate MaxBackends, etc. Another option I considered was to > > Sure. > >> automatically adjust the parameters during startup so that you just need to >> restart the server. However, we need to know for sure that the server is >> going to be a hot standby, and I don't believe we have that information >> where such GUC changes would need to occur (I could be wrong about this). > > Conldn't we use AlterSystemSetConfigFile for this purpose in > CheckRequiredParameterValues? That's an interesting idea. When an incompatible parameter change is detected, the server would automatically run the equivalent of an ALTER SYSTEM SET command before shutting down or pausing replay. I might draft up a proof-of-concept to see what this looks like and how well it works. >> Anyway, for now I'm just proposing the modest change described above, but >> I'd welcome any discussion about improving matters further in this area. >> >> [0] https://postgr.es/m/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b%402ndquadrant.com > > Is the reason for the enum the extensibility to add a new choice like > "auto-adjust"? Yes. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: allow specifying action when standby encounters incompatible parameter settings
From
Cary Huang
Date:
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Hello The patch applied and tested fine. I think for this kind of exception, it really is up to the administrator how he/she shouldproceed to resolve depending on his/her business application. Leaving things configurable by the user is generallya nice and modest change. I also like that you leave the parameters as enum entry so it is possible to extend otherpossible actions such as automatically adjust to match the new value. --------- Cary Huang HighGo Software Canada
Re: allow specifying action when standby encounters incompatible parameter settings
From
Nathan Bossart
Date:
On Fri, Apr 29, 2022 at 06:35:52PM +0000, Cary Huang wrote: > The patch applied and tested fine. I think for this kind of exception, it really is up to the administrator how he/sheshould proceed to resolve depending on his/her business application. Leaving things configurable by the user is generallya nice and modest change. I also like that you leave the parameters as enum entry so it is possible to extend otherpossible actions such as automatically adjust to match the new value. Thanks for reviewing! Do you think this patch can be marked as ready-for-committer? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: allow specifying action when standby encounters incompatible parameter settings
From
Simon Riggs
Date:
On Wed, 13 Apr 2022 at 22:35, Nathan Bossart <nathandbossart@gmail.com> wrote: > As of 15251c0, when a standby encounters an incompatible parameter change, > it pauses replay so that read traffic can continue while the administrator > fixes the parameters. Once the server is restarted, replay can continue. > Before this change, such incompatible parameter changes caused the standby > to immediately shut down. > > I noticed that there was some suggestion in the thread associated with > 15251c0 [0] for making this behavior configurable, but there didn't seem to > be much interest at the time. I am interested in allowing administrators > to specify the behavior before 15251c0 (i.e., immediately shut down the > standby when an incompatible parameter change is detected). The use-case I > have in mind is when an administrator has automation in place for adjusting > these parameters and would like to avoid stopping replay any longer than > necessary. FWIW this is what we do in RDS. I don't understand why you need this patch at all. Since you have automation, you can use that layer to automatically restart all standbys at once, if you choose, without any help or hindrance from PostgreSQL. But I really don't want you to do that, since that results in loss of availability of the service. I'd like you to try a little harder and automate this in a way that allows the service to continue with some standbys available while others restart. All this feature does is allow you to implement things in a lazy way that causes a loss of availability for users. I don't think that is of benefit to PostgreSQL users, so -1 from me, on this patch (only), sorry about that. -- Simon Riggs http://www.EnterpriseDB.com/
Re: allow specifying action when standby encounters incompatible parameter settings
From
Nathan Bossart
Date:
Thanks for chiming in. On Thu, Jun 23, 2022 at 04:19:45PM +0100, Simon Riggs wrote: > I don't understand why you need this patch at all. > > Since you have automation, you can use that layer to automatically > restart all standbys at once, if you choose, without any help or > hindrance from PostgreSQL. > > But I really don't want you to do that, since that results in loss of > availability of the service. I'd like you to try a little harder and > automate this in a way that allows the service to continue with some > standbys available while others restart. > > All this feature does is allow you to implement things in a lazy way > that causes a loss of availability for users. I don't think that is of > benefit to PostgreSQL users, so -1 from me, on this patch (only), > sorry about that. Overall, this is intended for users that care more about keeping WAL replay caught up than a temporary loss of availability due to a restart. Without this, I'd need to detect that WAL replay has paused due to insufficient parameters and restart Postgres. If І can configure Postgres to automatically shut down in these scenarios, my automation can skip right to adjusting the parameters and starting Postgres up. Of course, if you care more about availability, you'd keep this parameter set to the default (pause) and restart on your own schedule. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: allow specifying action when standby encounters incompatible parameter settings
From
Simon Riggs
Date:
On Thu, 23 Jun 2022 at 18:45, Nathan Bossart <nathandbossart@gmail.com> wrote: > > Thanks for chiming in. > > On Thu, Jun 23, 2022 at 04:19:45PM +0100, Simon Riggs wrote: > > I don't understand why you need this patch at all. > > > > Since you have automation, you can use that layer to automatically > > restart all standbys at once, if you choose, without any help or > > hindrance from PostgreSQL. > > > > But I really don't want you to do that, since that results in loss of > > availability of the service. I'd like you to try a little harder and > > automate this in a way that allows the service to continue with some > > standbys available while others restart. > > > > All this feature does is allow you to implement things in a lazy way > > that causes a loss of availability for users. I don't think that is of > > benefit to PostgreSQL users, so -1 from me, on this patch (only), > > sorry about that. > > Overall, this is intended for users that care more about keeping WAL replay > caught up than a temporary loss of availability due to a restart. Without > this, I'd need to detect that WAL replay has paused due to insufficient > parameters and restart Postgres. If І can configure Postgres to > automatically shut down in these scenarios, my automation can skip right to > adjusting the parameters and starting Postgres up. Of course, if you care > more about availability, you'd keep this parameter set to the default > (pause) and restart on your own schedule. There are a few choices of how we can deal with this situation 1. Make the change blindly and then pick up the pieces afterwards 2. Check the configuration before changes are made, and make the changes in the right order This patch and the above argument assumes that you must do (1), but you could easily do (2). i.e. If you know that changing specific parameters might affect availability, why not query those parameter values on all servers first and check whether the change will affect availability, before you allow the changes? why rely on PostgreSQL to pick up the pieces because the orchestration code doesn't (yet) make configuration sanity checks? This patch would undo a very important change - to keep servers available by default and go back to the old behavior for a huge fleet of Postgres databases. The old behavior of shutdown-on-change caused catastrophe many times for users and in one case brought down a rather large and important service provider, whose CTO explained to me quite clearly how stupid he thought that old behavior was. So I will not easily agree to allowing it to be put back into PostgreSQL, simply to avoid adding a small amount of easy code into an orchestration layer that could and should implement documented best practice. I am otherwise very appreciative of your insightful contributions, just not this specific one. Happy to discuss how we might introduce new parameters/behavior to reduce the list of parameters that need to be kept in lock-step. -- Simon Riggs http://www.EnterpriseDB.com/
Re: allow specifying action when standby encounters incompatible parameter settings
From
Nathan Bossart
Date:
On Fri, Jun 24, 2022 at 11:42:29AM +0100, Simon Riggs wrote: > This patch would undo a very important change - to keep servers > available by default and go back to the old behavior for a huge fleet > of Postgres databases. The old behavior of shutdown-on-change caused > catastrophe many times for users and in one case brought down a rather > large and important service provider, whose CTO explained to me quite > clearly how stupid he thought that old behavior was. So I will not > easily agree to allowing it to be put back into PostgreSQL, simply to > avoid adding a small amount of easy code into an orchestration layer > that could and should implement documented best practice. > > I am otherwise very appreciative of your insightful contributions, > just not this specific one. Given this feedback, I intend to mark the associated commitfest entry as Withdrawn at the conclusion of the current commitfest. > Happy to discuss how we might introduce new parameters/behavior to > reduce the list of parameters that need to be kept in lock-step. Me, too. I don't have anything concrete to propose at the moment. I thought Horiguchi-san's idea of automatically running ALTER SYSTEM was intriguing, but I have not explored that in depth. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: allow specifying action when standby encounters incompatible parameter settings
From
Nathan Bossart
Date:
On Mon, Jul 18, 2022 at 03:17:10PM -0700, Nathan Bossart wrote: > Given this feedback, I intend to mark the associated commitfest entry as > Withdrawn at the conclusion of the current commitfest. Done. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com