Thread: [HACKERS] error handling in RegisterBackgroundWorker
If RegisterBackgroundWorker() (the non-dynamic kind that is only loadable from shared_preload_libraries) fails to register the worker, it writes a log message and proceeds, ignoring the registration request. I think that is a mistake, it should be a hard error. The only way in practice to fix the problem is to change shared_preload_libraries or max_worker_processes, both requiring a restart anyway, so proceeding without the worker is not useful. Perhaps this kind of worker has not been widely used in practice, but we now have the logical replication launcher registering that way, and the auto-prewarm patch also proposes to add one. If you run out of worker slots before the launcher is supposed to start, it just logs a message and doesn't start. That seems prone to confuse. Attached is a proposed patch (0002) to change the log level to ERROR. The other patch (0001) gives some additional error context for the log/error message that you get. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Feb 15, 2017 at 11:30 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > If RegisterBackgroundWorker() (the non-dynamic kind that is only > loadable from shared_preload_libraries) fails to register the worker, it > writes a log message and proceeds, ignoring the registration request. I > think that is a mistake, it should be a hard error. The only way in > practice to fix the problem is to change shared_preload_libraries or > max_worker_processes, both requiring a restart anyway, so proceeding > without the worker is not useful. I guess the question is whether people will prefer to have the database start up and be missing the worker, or to have it not start. As you point out, the former is likely to result in an eventual restart, but the latter may lead to a longer period of downtime RIGHT NOW. People tend to really hate things that make the database not start, so I'm not sure what's best here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/15/17 12:11, Robert Haas wrote: > On Wed, Feb 15, 2017 at 11:30 AM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> If RegisterBackgroundWorker() (the non-dynamic kind that is only >> loadable from shared_preload_libraries) fails to register the worker, it >> writes a log message and proceeds, ignoring the registration request. I >> think that is a mistake, it should be a hard error. The only way in >> practice to fix the problem is to change shared_preload_libraries or >> max_worker_processes, both requiring a restart anyway, so proceeding >> without the worker is not useful. > > I guess the question is whether people will prefer to have the > database start up and be missing the worker, or to have it not start. > As you point out, the former is likely to result in an eventual > restart, but the latter may lead to a longer period of downtime RIGHT > NOW. People tend to really hate things that make the database not > start, so I'm not sure what's best here. Any other thoughts on this? Seems like a potential usability issue. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 24, 2017 at 1:20 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 2/15/17 12:11, Robert Haas wrote: >> On Wed, Feb 15, 2017 at 11:30 AM, Peter Eisentraut >> <peter.eisentraut@2ndquadrant.com> wrote: >>> If RegisterBackgroundWorker() (the non-dynamic kind that is only >>> loadable from shared_preload_libraries) fails to register the worker, it >>> writes a log message and proceeds, ignoring the registration request. I >>> think that is a mistake, it should be a hard error. The only way in >>> practice to fix the problem is to change shared_preload_libraries or >>> max_worker_processes, both requiring a restart anyway, so proceeding >>> without the worker is not useful. >> >> I guess the question is whether people will prefer to have the >> database start up and be missing the worker, or to have it not start. >> As you point out, the former is likely to result in an eventual >> restart, but the latter may lead to a longer period of downtime RIGHT >> NOW. People tend to really hate things that make the database not >> start, so I'm not sure what's best here. > > Any other thoughts on this? Seems like a potential usability issue. What if we just let the user choose what they want with a new switch in bgw_flags, but keep LOG the default? One behavior and the other look both sensible to me. @@ -824,7 +824,8 @@ RegisterBackgroundWorker(BackgroundWorker *worker) "Up to %d backgroundworkers can be registered with the current settings.", max_worker_processes, max_worker_processes), - errhint("Consider increasing the configuration parameter \"max_worker_processes\"."))); + errhint("Consider increasing the configuration parameter \"max_worker_processes\"."), + errcontext("registration of background worker \"%s\"", worker->bgw_name))); return; } No issues with this bit in 0001. -- Michael
On 3/24/17 02:33, Michael Paquier wrote: > What if we just let the user choose what they want with a new switch > in bgw_flags, but keep LOG the default? One behavior and the other > look both sensible to me. How specifically would we do that? And what user would choose the behavior "start this background worker but don't worry if it doesn't work"? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 29, 2017 at 2:10 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > How specifically would we do that? And what user would choose the > behavior "start this background worker but don't worry if it doesn't work"? Well, if the background worker is auto-prewarm, you'd probably rather have the database start rather than get unhappy about auto-prewarm failing. If the background worker is your logical replication launcher it's a bit more serious, but if you have no subscriptions or they're not that critical, maybe you don't care. If the background worker is in charge of telling your failover solution that this node is up, then starting without it is entirely pointless. I would be inclined to leave this alone for now and revisit it for a future release. I don't feel confident that we really know what the right thing to do is here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, Mar 29, 2017 at 2:10 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: > > How specifically would we do that? And what user would choose the > > behavior "start this background worker but don't worry if it doesn't work"? > > Well, if the background worker is auto-prewarm, you'd probably rather > have the database start rather than get unhappy about auto-prewarm > failing. If the background worker is your logical replication > launcher it's a bit more serious, but if you have no subscriptions or > they're not that critical, maybe you don't care. If the background > worker is in charge of telling your failover solution that this node > is up, then starting without it is entirely pointless. > > I would be inclined to leave this alone for now and revisit it for a > future release. I don't feel confident that we really know what the > right thing to do is here. I think the common case is for modules to be critical: you may not care about it for auto-prewarm, but that seems like a special case. I would put it the other way around: having the load fail is a serious problem unless specifically configured not to be. I'd do as Peter suggests, and perhaps allow the current behavior optionally. In hindsight, the current behavior seems like a mistake. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 29, 2017 at 04:58:40PM -0300, Alvaro Herrera wrote: > Robert Haas wrote: > > On Wed, Mar 29, 2017 at 2:10 PM, Peter Eisentraut > > <peter.eisentraut@2ndquadrant.com> wrote: > > > How specifically would we do that? And what user would choose the > > > behavior "start this background worker but don't worry if it doesn't work"? > > > > Well, if the background worker is auto-prewarm, you'd probably rather > > have the database start rather than get unhappy about auto-prewarm > > failing. If the background worker is your logical replication > > launcher it's a bit more serious, but if you have no subscriptions or > > they're not that critical, maybe you don't care. If the background > > worker is in charge of telling your failover solution that this node > > is up, then starting without it is entirely pointless. > > > > I would be inclined to leave this alone for now and revisit it for a > > future release. I don't feel confident that we really know what the > > right thing to do is here. > > I think the common case is for modules to be critical: you may not care > about it for auto-prewarm, but that seems like a special case. I would > put it the other way around: having the load fail is a serious problem > unless specifically configured not to be. I'd do as Peter suggests, and > perhaps allow the current behavior optionally. In hindsight, the > current behavior seems like a mistake. Agreed. There are times when starting up degraded beats failing to start, particularly when the failing component has complicated dependencies. In this case, as detailed upthread, this can only fail in response to basic misconfiguration. It's not the kind of thing that will spontaneously fail after a minor upgrade, for example. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 4/9/17 22:40, Noah Misch wrote: > Agreed. There are times when starting up degraded beats failing to start, > particularly when the failing component has complicated dependencies. In this > case, as detailed upthread, this can only fail in response to basic > misconfiguration. It's not the kind of thing that will spontaneously fail > after a minor upgrade, for example. If history had been different, we could have implemented, say, autovacuum or walreceiver on the background worker framework. I think unifying some of that might actually be a future project. Would it be OK if these processes just logged a warning and didn't start if there was a misconfiguration? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 10, 2017 at 09:36:59PM -0400, Peter Eisentraut wrote: > On 4/9/17 22:40, Noah Misch wrote: > > Agreed. There are times when starting up degraded beats failing to start, > > particularly when the failing component has complicated dependencies. In this > > case, as detailed upthread, this can only fail in response to basic > > misconfiguration. It's not the kind of thing that will spontaneously fail > > after a minor upgrade, for example. > > If history had been different, we could have implemented, say, > autovacuum or walreceiver on the background worker framework. I think > unifying some of that might actually be a future project. Would it be > OK if these processes just logged a warning and didn't start if there > was a misconfiguration? No. I can't think of any background worker so unimportant that I'd want the warning only. Certainly, then, the ones you list are far too important.
Noah Misch <noah@leadboat.com> writes: > On Mon, Apr 10, 2017 at 09:36:59PM -0400, Peter Eisentraut wrote: >> If history had been different, we could have implemented, say, >> autovacuum or walreceiver on the background worker framework. I think >> unifying some of that might actually be a future project. Would it be >> OK if these processes just logged a warning and didn't start if there >> was a misconfiguration? > No. I can't think of any background worker so unimportant that I'd want the > warning only. Certainly, then, the ones you list are far too important. Well, I dunno. We allow the system to start without a functioning stats collector (cf what happens if we fail to create a working loopback socket). But lack of stats will effectively disable autovacuum. So at the very least we have some inconsistent decisions here. Personally I'd err on the side of "starting up degraded is better than not starting at all". Or maybe we should invent a GUC to let DBAs express their preference on that? regards, tom lane
On 4/10/17 23:22, Tom Lane wrote: > Personally I'd err on the side of "starting up degraded is better than > not starting at all". Or maybe we should invent a GUC to let DBAs > express their preference on that? If we defaulted allow_degraded to yes, then users wouldn't find that setting until they do start up degraded and want to fix things, in which case they could just fix the settings that caused the degraded startup in the first place. If we defaulted to no, then I don't think any user would go in and change it. "Sure, I'll allow degraded startup. That sounds useful." I think there is no clear agreement here, and no historically consistent behavior. I'm prepared to let it go and cross it off the list of open items. I believe we should keep thinking about it, but it's not something that has to hold up beta. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 4/10/17 23:22, Tom Lane wrote: >> Personally I'd err on the side of "starting up degraded is better than >> not starting at all". Or maybe we should invent a GUC to let DBAs >> express their preference on that? > If we defaulted allow_degraded to yes, then users wouldn't find that > setting until they do start up degraded and want to fix things, in which > case they could just fix the settings that caused the degraded startup > in the first place. > If we defaulted to no, then I don't think any user would go in and > change it. "Sure, I'll allow degraded startup. That sounds useful." Well, they would change it when their server failed to start and they needed to start it rather than just rebuild from backups. I'd be fine with defaulting it off. I just don't want "can't make a loopback socket" to be equivalent to "you're screwed and you'll never see your data again". > I think there is no clear agreement here, and no historically consistent > behavior. I'm prepared to let it go and cross it off the list of open > items. I believe we should keep thinking about it, but it's not > something that has to hold up beta. Agreed, this doesn't seem like a must-fix-for-beta consideration. regards, tom lane
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 4/10/17 23:22, Tom Lane wrote:
>> Personally I'd err on the side of "starting up degraded is better than
>> not starting at all". Or maybe we should invent a GUC to let DBAs
>> express their preference on that?
> If we defaulted allow_degraded to yes, then users wouldn't find that
> setting until they do start up degraded and want to fix things, in which
> case they could just fix the settings that caused the degraded startup
> in the first place.
> If we defaulted to no, then I don't think any user would go in and
> change it. "Sure, I'll allow degraded startup. That sounds useful."
Well, they would change it when their server failed to start and they
needed to start it rather than just rebuild from backups. I'd be fine
with defaulting it off. I just don't want "can't make a loopback socket"
to be equivalent to "you're screwed and you'll never see your data again".
A potential middle-ground is to start, but then only allow superuser connections. At least then if the configuration problem is sitting postgresql.conf.auto the superuser can issue ALTER SYSETM to fix it; and can be reassured that worse case they can at least see their data.
If that was a soft setting they could also run a function to enable normal access if they so choose. In effect its a "default to off" mode with an easy way to force the system to become live - but without a GUC (so they couldn't make the decision permanent...which seems like a feature)
David J.
On 4/11/17 11:47, David G. Johnston wrote: > A potential middle-ground is to start, but then only allow superuser > connections. Then you might as well start and allow all connections. I don't see what this buys. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/11/17 11:47, David G. Johnston wrote:
> A potential middle-ground is to start, but then only allow superuser
> connections.
Then you might as well start and allow all connections. I don't see
what this buys.
If "leave it offline until it gets fixed" is on the table then there is some underlying reason that we'd not want application (or replication) users connecting to the database while it is in a degraded state. Even if one accepts that premise that doesn't mean that an administrator shouldn't be allowed to login and do ad-hoc stuff; the goal of the prevention is to disallow programmed external actors that assume/require that these background worker processes are active from connecting while they are not. This middle-ground accommodates that goal in a precise manner.
I don't have an opinion as to which extreme is better so in the absence I'm in favor of "put control in the hands of the administrator" - this just provides a slightly more usable environment for the administrator to operate within.
David J.
On Tue, Apr 11, 2017 at 11:33:34AM -0400, Tom Lane wrote: > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > > I think there is no clear agreement here, and no historically consistent > > behavior. I'm prepared to let it go and cross it off the list of open > > items. I believe we should keep thinking about it, but it's not > > something that has to hold up beta. > > Agreed, this doesn't seem like a must-fix-for-beta consideration. Definitely not a beta blocker, agreed. Would it be okay to release v10 final with the logical replication launcher soft-failing this way?
On Mon, Apr 10, 2017 at 11:22:38PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Mon, Apr 10, 2017 at 09:36:59PM -0400, Peter Eisentraut wrote: > >> If history had been different, we could have implemented, say, > >> autovacuum or walreceiver on the background worker framework. I think > >> unifying some of that might actually be a future project. Would it be > >> OK if these processes just logged a warning and didn't start if there > >> was a misconfiguration? > > > No. I can't think of any background worker so unimportant that I'd want the > > warning only. Certainly, then, the ones you list are far too important. > > Well, I dunno. We allow the system to start without a functioning stats > collector (cf what happens if we fail to create a working loopback > socket). But lack of stats will effectively disable autovacuum. So at > the very least we have some inconsistent decisions here. Yep. I mentioned "complicated dependencies" as a factor, and that's relevant to the stats collector. While creating a loopback network socket is not highly complicated, it does fail in the field, and the user owning PostgreSQL may lack the means to fix it. RegisterBackgroundWorker() failure, on the other hand, implies the DBA changed a PGC_POSTMASTER setting like shared_preload_libraries or max_logical_replication_workers without raising max_worker_processes. Best to get unmistakable feedback and immediately raise max_worker_processes or rollback the causative GUC change. > Personally I'd err on the side of "starting up degraded is better than > not starting at all". Or maybe we should invent a GUC to let DBAs > express their preference on that? Maybe. I share Peter's doubts. Also, not all degradation is equal, and one user may cherish worker A alone while another user cherishes worker B alone. Still, maybe.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Tue, Apr 11, 2017 at 11:10 AM, Peter Eisentraut < > peter.eisentraut@2ndquadrant.com> wrote: >> On 4/11/17 11:47, David G. Johnston wrote: >>> A potential middle-ground is to start, but then only allow superuser >>> connections. >> Then you might as well start and allow all connections. I don't see >> what this buys. > If "leave it offline until it gets fixed" is on the table then there is > some underlying reason that we'd not want application (or replication) > users connecting to the database while it is in a degraded state. Even if > one accepts that premise that doesn't mean that an administrator shouldn't > be allowed to login and do ad-hoc stuff; the goal of the prevention is to > disallow programmed external actors that assume/require that these > background worker processes are active from connecting while they are not. > This middle-ground accommodates that goal in a precise manner. Only if you assume that those external scripts aren't connecting as superuser. Unfortunately, that assumption is probably widely violated, especially in the sort of less-well-run installations that are most at risk for the kind of problem we're speculating about here. > I don't have an opinion as to which extreme is better so in the absence I'm > in favor of "put control in the hands of the administrator" - this just > provides a slightly more usable environment for the administrator to > operate within. Yeah, "usable environment" is key here. A point I meant to bring up earlier today is that we do already have a solution for degraded operation, ie, run a standalone backend. But that's so awful, from any standpoint including UI friendliness (no psql, let alone pgadmin or other GUI tools), performance (no bgwriter, walwriter, stats collector, or autovacuum), or reliability (no automatic checkpoints, never mind replication), that nobody in their right mind would choose to use it until their back was very hard against the wall. So what we're really discussing here is intermediate operating modes where you have at least some of those facilities. I doubt there are black-and-white answers, but I still believe in the likely usefulness of a mode where we start as much of that stuff as we can. regards, tom lane
On Tue, Apr 11, 2017 at 10:13 PM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Apr 11, 2017 at 11:33:34AM -0400, Tom Lane wrote: >> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: >> > I think there is no clear agreement here, and no historically consistent >> > behavior. I'm prepared to let it go and cross it off the list of open >> > items. I believe we should keep thinking about it, but it's not >> > something that has to hold up beta. >> >> Agreed, this doesn't seem like a must-fix-for-beta consideration. > > Definitely not a beta blocker, agreed. Would it be okay to release v10 final > with the logical replication launcher soft-failing this way? I'm not really in favor of making a behavior change here, so it would be fine with me. Obviously people who think the behavior should be changed are more likely to disagree with that idea. Maybe in the long run we should have a command-line option (not a GUC) that's like... postgres --soldier-on-valiently ...and then when that's not supplied we can die but when it is supplied we persist in spite of all failures that are in any way recoverable. However, I think that's really a new development effort, not a cleanup item for v10. I share Tom's concern to the effect that single-user mode is a really awful way to try to recover from failures, so we should avoid decisions that force people into that recovery mode more often. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company