Thread: [HACKERS] error handling in RegisterBackgroundWorker

[HACKERS] error handling in RegisterBackgroundWorker

From
Peter Eisentraut
Date:
If RegisterBackgroundWorker() (the non-dynamic kind that is only
loadable from shared_preload_libraries) fails to register the worker, it
writes a log message and proceeds, ignoring the registration request.  I
think that is a mistake, it should be a hard error.  The only way in
practice to fix the problem is to change shared_preload_libraries or
max_worker_processes, both requiring a restart anyway, so proceeding
without the worker is not useful.

Perhaps this kind of worker has not been widely used in practice, but we
now have the logical replication launcher registering that way, and the
auto-prewarm patch also proposes to add one.  If you run out of worker
slots before the launcher is supposed to start, it just logs a message
and doesn't start.  That seems prone to confuse.

Attached is a proposed patch (0002) to change the log level to ERROR.
The other patch (0001) gives some additional error context for the
log/error message that you get.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Robert Haas
Date:
On Wed, Feb 15, 2017 at 11:30 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> If RegisterBackgroundWorker() (the non-dynamic kind that is only
> loadable from shared_preload_libraries) fails to register the worker, it
> writes a log message and proceeds, ignoring the registration request.  I
> think that is a mistake, it should be a hard error.  The only way in
> practice to fix the problem is to change shared_preload_libraries or
> max_worker_processes, both requiring a restart anyway, so proceeding
> without the worker is not useful.

I guess the question is whether people will prefer to have the
database start up and be missing the worker, or to have it not start.
As you point out, the former is likely to result in an eventual
restart, but the latter may lead to a longer period of downtime RIGHT
NOW.  People tend to really hate things that make the database not
start, so I'm not sure what's best here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Peter Eisentraut
Date:
On 2/15/17 12:11, Robert Haas wrote:
> On Wed, Feb 15, 2017 at 11:30 AM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> If RegisterBackgroundWorker() (the non-dynamic kind that is only
>> loadable from shared_preload_libraries) fails to register the worker, it
>> writes a log message and proceeds, ignoring the registration request.  I
>> think that is a mistake, it should be a hard error.  The only way in
>> practice to fix the problem is to change shared_preload_libraries or
>> max_worker_processes, both requiring a restart anyway, so proceeding
>> without the worker is not useful.
> 
> I guess the question is whether people will prefer to have the
> database start up and be missing the worker, or to have it not start.
> As you point out, the former is likely to result in an eventual
> restart, but the latter may lead to a longer period of downtime RIGHT
> NOW.  People tend to really hate things that make the database not
> start, so I'm not sure what's best here.

Any other thoughts on this?  Seems like a potential usability issue.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Michael Paquier
Date:
On Fri, Mar 24, 2017 at 1:20 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2/15/17 12:11, Robert Haas wrote:
>> On Wed, Feb 15, 2017 at 11:30 AM, Peter Eisentraut
>> <peter.eisentraut@2ndquadrant.com> wrote:
>>> If RegisterBackgroundWorker() (the non-dynamic kind that is only
>>> loadable from shared_preload_libraries) fails to register the worker, it
>>> writes a log message and proceeds, ignoring the registration request.  I
>>> think that is a mistake, it should be a hard error.  The only way in
>>> practice to fix the problem is to change shared_preload_libraries or
>>> max_worker_processes, both requiring a restart anyway, so proceeding
>>> without the worker is not useful.
>>
>> I guess the question is whether people will prefer to have the
>> database start up and be missing the worker, or to have it not start.
>> As you point out, the former is likely to result in an eventual
>> restart, but the latter may lead to a longer period of downtime RIGHT
>> NOW.  People tend to really hate things that make the database not
>> start, so I'm not sure what's best here.
>
> Any other thoughts on this?  Seems like a potential usability issue.

What if we just let the user choose what they want with a new switch
in bgw_flags, but keep LOG the default? One behavior and the other
look both sensible to me.

@@ -824,7 +824,8 @@ RegisterBackgroundWorker(BackgroundWorker *worker)                                  "Up to %d
backgroundworkers can be
 
registered with the current settings.",                                  max_worker_processes,
       max_worker_processes),
 
-                 errhint("Consider increasing the configuration
parameter \"max_worker_processes\".")));
+                 errhint("Consider increasing the configuration
parameter \"max_worker_processes\"."),
+                 errcontext("registration of background worker
\"%s\"", worker->bgw_name)));        return;    }
No issues with this bit in 0001.
-- 
Michael



Re: error handling in RegisterBackgroundWorker

From
Peter Eisentraut
Date:
On 3/24/17 02:33, Michael Paquier wrote:
> What if we just let the user choose what they want with a new switch
> in bgw_flags, but keep LOG the default? One behavior and the other
> look both sensible to me.

How specifically would we do that?  And what user would choose the
behavior "start this background worker but don't worry if it doesn't work"?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: error handling in RegisterBackgroundWorker

From
Robert Haas
Date:
On Wed, Mar 29, 2017 at 2:10 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> How specifically would we do that?  And what user would choose the
> behavior "start this background worker but don't worry if it doesn't work"?

Well, if the background worker is auto-prewarm, you'd probably rather
have the database start rather than get unhappy about auto-prewarm
failing.  If the background worker is your logical replication
launcher it's a bit more serious, but if you have no subscriptions or
they're not that critical, maybe you don't care.  If the background
worker is in charge of telling your failover solution that this node
is up, then starting without it is entirely pointless.

I would be inclined to leave this alone for now and revisit it for a
future release.  I don't feel confident that we really know what the
right thing to do is here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: error handling in RegisterBackgroundWorker

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Wed, Mar 29, 2017 at 2:10 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
> > How specifically would we do that?  And what user would choose the
> > behavior "start this background worker but don't worry if it doesn't work"?
> 
> Well, if the background worker is auto-prewarm, you'd probably rather
> have the database start rather than get unhappy about auto-prewarm
> failing.  If the background worker is your logical replication
> launcher it's a bit more serious, but if you have no subscriptions or
> they're not that critical, maybe you don't care.  If the background
> worker is in charge of telling your failover solution that this node
> is up, then starting without it is entirely pointless.
> 
> I would be inclined to leave this alone for now and revisit it for a
> future release.  I don't feel confident that we really know what the
> right thing to do is here.

I think the common case is for modules to be critical: you may not care
about it for auto-prewarm, but that seems like a special case.  I would
put it the other way around: having the load fail is a serious problem
unless specifically configured not to be.  I'd do as Peter suggests, and
perhaps allow the current behavior optionally.  In hindsight, the
current behavior seems like a mistake.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Noah Misch
Date:
On Wed, Mar 29, 2017 at 04:58:40PM -0300, Alvaro Herrera wrote:
> Robert Haas wrote:
> > On Wed, Mar 29, 2017 at 2:10 PM, Peter Eisentraut
> > <peter.eisentraut@2ndquadrant.com> wrote:
> > > How specifically would we do that?  And what user would choose the
> > > behavior "start this background worker but don't worry if it doesn't work"?
> > 
> > Well, if the background worker is auto-prewarm, you'd probably rather
> > have the database start rather than get unhappy about auto-prewarm
> > failing.  If the background worker is your logical replication
> > launcher it's a bit more serious, but if you have no subscriptions or
> > they're not that critical, maybe you don't care.  If the background
> > worker is in charge of telling your failover solution that this node
> > is up, then starting without it is entirely pointless.
> > 
> > I would be inclined to leave this alone for now and revisit it for a
> > future release.  I don't feel confident that we really know what the
> > right thing to do is here.
> 
> I think the common case is for modules to be critical: you may not care
> about it for auto-prewarm, but that seems like a special case.  I would
> put it the other way around: having the load fail is a serious problem
> unless specifically configured not to be.  I'd do as Peter suggests, and
> perhaps allow the current behavior optionally.  In hindsight, the
> current behavior seems like a mistake.

Agreed.  There are times when starting up degraded beats failing to start,
particularly when the failing component has complicated dependencies.  In this
case, as detailed upthread, this can only fail in response to basic
misconfiguration.  It's not the kind of thing that will spontaneously fail
after a minor upgrade, for example.


[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Peter Eisentraut
Date:
On 4/9/17 22:40, Noah Misch wrote:
> Agreed.  There are times when starting up degraded beats failing to start,
> particularly when the failing component has complicated dependencies.  In this
> case, as detailed upthread, this can only fail in response to basic
> misconfiguration.  It's not the kind of thing that will spontaneously fail
> after a minor upgrade, for example.

If history had been different, we could have implemented, say,
autovacuum or walreceiver on the background worker framework.  I think
unifying some of that might actually be a future project.  Would it be
OK if these processes just logged a warning and didn't start if there
was a misconfiguration?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Noah Misch
Date:
On Mon, Apr 10, 2017 at 09:36:59PM -0400, Peter Eisentraut wrote:
> On 4/9/17 22:40, Noah Misch wrote:
> > Agreed.  There are times when starting up degraded beats failing to start,
> > particularly when the failing component has complicated dependencies.  In this
> > case, as detailed upthread, this can only fail in response to basic
> > misconfiguration.  It's not the kind of thing that will spontaneously fail
> > after a minor upgrade, for example.
> 
> If history had been different, we could have implemented, say,
> autovacuum or walreceiver on the background worker framework.  I think
> unifying some of that might actually be a future project.  Would it be
> OK if these processes just logged a warning and didn't start if there
> was a misconfiguration?

No.  I can't think of any background worker so unimportant that I'd want the
warning only.  Certainly, then, the ones you list are far too important.



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Mon, Apr 10, 2017 at 09:36:59PM -0400, Peter Eisentraut wrote:
>> If history had been different, we could have implemented, say,
>> autovacuum or walreceiver on the background worker framework.  I think
>> unifying some of that might actually be a future project.  Would it be
>> OK if these processes just logged a warning and didn't start if there
>> was a misconfiguration?

> No.  I can't think of any background worker so unimportant that I'd want the
> warning only.  Certainly, then, the ones you list are far too important.

Well, I dunno.  We allow the system to start without a functioning stats
collector (cf what happens if we fail to create a working loopback
socket).  But lack of stats will effectively disable autovacuum.  So at
the very least we have some inconsistent decisions here.

Personally I'd err on the side of "starting up degraded is better than
not starting at all".  Or maybe we should invent a GUC to let DBAs
express their preference on that?
        regards, tom lane



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Peter Eisentraut
Date:
On 4/10/17 23:22, Tom Lane wrote:
> Personally I'd err on the side of "starting up degraded is better than
> not starting at all".  Or maybe we should invent a GUC to let DBAs
> express their preference on that?

If we defaulted allow_degraded to yes, then users wouldn't find that
setting until they do start up degraded and want to fix things, in which
case they could just fix the settings that caused the degraded startup
in the first place.

If we defaulted to no, then I don't think any user would go in and
change it.  "Sure, I'll allow degraded startup.  That sounds useful."

I think there is no clear agreement here, and no historically consistent
behavior.  I'm prepared to let it go and cross it off the list of open
items.  I believe we should keep thinking about it, but it's not
something that has to hold up beta.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 4/10/17 23:22, Tom Lane wrote:
>> Personally I'd err on the side of "starting up degraded is better than
>> not starting at all".  Or maybe we should invent a GUC to let DBAs
>> express their preference on that?

> If we defaulted allow_degraded to yes, then users wouldn't find that
> setting until they do start up degraded and want to fix things, in which
> case they could just fix the settings that caused the degraded startup
> in the first place.

> If we defaulted to no, then I don't think any user would go in and
> change it.  "Sure, I'll allow degraded startup.  That sounds useful."

Well, they would change it when their server failed to start and they
needed to start it rather than just rebuild from backups.  I'd be fine
with defaulting it off.  I just don't want "can't make a loopback socket"
to be equivalent to "you're screwed and you'll never see your data again".

> I think there is no clear agreement here, and no historically consistent
> behavior.  I'm prepared to let it go and cross it off the list of open
> items.  I believe we should keep thinking about it, but it's not
> something that has to hold up beta.

Agreed, this doesn't seem like a must-fix-for-beta consideration.
        regards, tom lane



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
"David G. Johnston"
Date:
On Tue, Apr 11, 2017 at 8:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 4/10/17 23:22, Tom Lane wrote:
>> Personally I'd err on the side of "starting up degraded is better than
>> not starting at all".  Or maybe we should invent a GUC to let DBAs
>> express their preference on that?

> If we defaulted allow_degraded to yes, then users wouldn't find that
> setting until they do start up degraded and want to fix things, in which
> case they could just fix the settings that caused the degraded startup
> in the first place.

> If we defaulted to no, then I don't think any user would go in and
> change it.  "Sure, I'll allow degraded startup.  That sounds useful."

Well, they would change it when their server failed to start and they
needed to start it rather than just rebuild from backups.  I'd be fine
with defaulting it off.  I just don't want "can't make a loopback socket"
to be equivalent to "you're screwed and you'll never see your data again".

​A potential middle-ground is to start, but then only allow superuser connections.  At least then if the configuration problem is sitting postgresql.conf.auto the superuser can issue ALTER SYSETM to fix it; and can be reassured that worse case they can at least see their data.

If that was a soft setting they could also run a function to enable normal access if they so choose.  In effect its a "default to off" mode with an easy way to force the system to become live - but without a GUC (so they couldn't make the decision permanent...which seems like a feature)

David J.

Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Peter Eisentraut
Date:
On 4/11/17 11:47, David G. Johnston wrote:
> ​A potential middle-ground is to start, but then only allow superuser
> connections.

Then you might as well start and allow all connections.  I don't see
what this buys.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
"David G. Johnston"
Date:
On Tue, Apr 11, 2017 at 11:10 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 4/11/17 11:47, David G. Johnston wrote:
> ​A potential middle-ground is to start, but then only allow superuser
> connections.

Then you might as well start and allow all connections.  I don't see
what this buys.


​If "leave it offline until it gets fixed" is on the table then there is some underlying reason that we'd not want application (or replication) users connecting to the database while it is in a degraded state.  Even if one accepts that premise that doesn't mean that an administrator shouldn't be allowed to login and do ad-hoc stuff; the goal of the prevention is to disallow programmed external actors that assume/require that these background worker processes are active from connecting while they are not.  This middle-ground accommodates that goal​ in a precise manner.

I don't have an opinion as to which extreme is better so in the absence I'm in favor of "put control in the hands of the administrator" - this just provides a slightly more usable environment for the administrator to operate within.

David J.

Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Noah Misch
Date:
On Tue, Apr 11, 2017 at 11:33:34AM -0400, Tom Lane wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> > I think there is no clear agreement here, and no historically consistent
> > behavior.  I'm prepared to let it go and cross it off the list of open
> > items.  I believe we should keep thinking about it, but it's not
> > something that has to hold up beta.
> 
> Agreed, this doesn't seem like a must-fix-for-beta consideration.

Definitely not a beta blocker, agreed.  Would it be okay to release v10 final
with the logical replication launcher soft-failing this way?



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Noah Misch
Date:
On Mon, Apr 10, 2017 at 11:22:38PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Apr 10, 2017 at 09:36:59PM -0400, Peter Eisentraut wrote:
> >> If history had been different, we could have implemented, say,
> >> autovacuum or walreceiver on the background worker framework.  I think
> >> unifying some of that might actually be a future project.  Would it be
> >> OK if these processes just logged a warning and didn't start if there
> >> was a misconfiguration?
> 
> > No.  I can't think of any background worker so unimportant that I'd want the
> > warning only.  Certainly, then, the ones you list are far too important.
> 
> Well, I dunno.  We allow the system to start without a functioning stats
> collector (cf what happens if we fail to create a working loopback
> socket).  But lack of stats will effectively disable autovacuum.  So at
> the very least we have some inconsistent decisions here.

Yep.  I mentioned "complicated dependencies" as a factor, and that's relevant
to the stats collector.  While creating a loopback network socket is not
highly complicated, it does fail in the field, and the user owning PostgreSQL
may lack the means to fix it.  RegisterBackgroundWorker() failure, on the
other hand, implies the DBA changed a PGC_POSTMASTER setting like
shared_preload_libraries or max_logical_replication_workers without raising
max_worker_processes.  Best to get unmistakable feedback and immediately raise
max_worker_processes or rollback the causative GUC change.

> Personally I'd err on the side of "starting up degraded is better than
> not starting at all".  Or maybe we should invent a GUC to let DBAs
> express their preference on that?

Maybe.  I share Peter's doubts.  Also, not all degradation is equal, and one
user may cherish worker A alone while another user cherishes worker B alone.
Still, maybe.



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Tue, Apr 11, 2017 at 11:10 AM, Peter Eisentraut <
> peter.eisentraut@2ndquadrant.com> wrote:
>> On 4/11/17 11:47, David G. Johnston wrote:
>>> ​A potential middle-ground is to start, but then only allow superuser
>>> connections.

>> Then you might as well start and allow all connections.  I don't see
>> what this buys.

> If "leave it offline until it gets fixed" is on the table then there is
> some underlying reason that we'd not want application (or replication)
> users connecting to the database while it is in a degraded state.  Even if
> one accepts that premise that doesn't mean that an administrator shouldn't
> be allowed to login and do ad-hoc stuff; the goal of the prevention is to
> disallow programmed external actors that assume/require that these
> background worker processes are active from connecting while they are not.
> This middle-ground accommodates that goal in a precise manner.

Only if you assume that those external scripts aren't connecting as
superuser.  Unfortunately, that assumption is probably widely violated,
especially in the sort of less-well-run installations that are most
at risk for the kind of problem we're speculating about here.

> I don't have an opinion as to which extreme is better so in the absence I'm
> in favor of "put control in the hands of the administrator" - this just
> provides a slightly more usable environment for the administrator to
> operate within.

Yeah, "usable environment" is key here.  A point I meant to bring up
earlier today is that we do already have a solution for degraded
operation, ie, run a standalone backend.  But that's so awful, from
any standpoint including UI friendliness (no psql, let alone pgadmin or
other GUI tools), performance (no bgwriter, walwriter, stats collector,
or autovacuum), or reliability (no automatic checkpoints, never mind
replication), that nobody in their right mind would choose to use it
until their back was very hard against the wall.  So what we're
really discussing here is intermediate operating modes where you have
at least some of those facilities.  I doubt there are black-and-white
answers, but I still believe in the likely usefulness of a mode where
we start as much of that stuff as we can.
        regards, tom lane



Re: [HACKERS] error handling in RegisterBackgroundWorker

From
Robert Haas
Date:
On Tue, Apr 11, 2017 at 10:13 PM, Noah Misch <noah@leadboat.com> wrote:
> On Tue, Apr 11, 2017 at 11:33:34AM -0400, Tom Lane wrote:
>> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> > I think there is no clear agreement here, and no historically consistent
>> > behavior.  I'm prepared to let it go and cross it off the list of open
>> > items.  I believe we should keep thinking about it, but it's not
>> > something that has to hold up beta.
>>
>> Agreed, this doesn't seem like a must-fix-for-beta consideration.
>
> Definitely not a beta blocker, agreed.  Would it be okay to release v10 final
> with the logical replication launcher soft-failing this way?

I'm not really in favor of making a behavior change here, so it would
be fine with me.  Obviously people who think the behavior should be
changed are more likely to disagree with that idea.  Maybe in the long
run we should have a command-line option (not a GUC) that's like...

postgres --soldier-on-valiently

...and then when that's not supplied we can die but when it is
supplied we persist in spite of all failures that are in any way
recoverable.  However, I think that's really a new development effort,
not a cleanup item for v10.  I share Tom's concern to the effect that
single-user mode is a really awful way to try to recover from
failures, so we should avoid decisions that force people into that
recovery mode more often.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company