Re: dynamic background workers, round two - Mailing list pgsql-hackers

From Andres Freund
Subject Re: dynamic background workers, round two
Date
Msg-id 20130811053136.GC7843@alap2.anarazel.de
Whole thread Raw
In response to dynamic background workers, round two  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: dynamic background workers, round two  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
[sent again, previously sent as reply, instead of reply-all, thanks
Robert]

On 2013-08-09 09:09:08 -0400, Robert Haas wrote:
> On Fri, Jul 26, 2013 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Jul 26, 2013 at 5:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> It doesn't need to be the postmaster, but I think we need to provide
> >> central infrastructure for that. I don't want this to end up being
> >> redone poorly in multiple places.
> >> I just wanted to mention it, it obviously doesn't need to be implemented
> >> at the same time.
> >
> > OK, makes sense.  I'm still working out a design for this part of it;
> > I think I want to get the dynamic shared memory stuff working first.

Makes sense. I just wanted to mention that we should keep on the back of
our minds while working on this.

> >>> >> +#define InvalidPid                           (-1)
> >>> >> +
> >>> >
> >>> > That value strikes me as a rather bad idea. As a pid that represents
> >>> > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
> >>> > rather not spread that outside async.c.
> >>>
> >>> Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug.
> >>> And if the postgres process has permission to send signals to init,
> >>> that's an OS bug (or a PostgreSQL bug, since we refuse to run as
> >>> root).  So I'm not very worried about it.
> >>
> >> I don't think somebody mistakenly kill()ing somebody with -1 is all too
> >> likely, but it's a valid value to pass. That's why I dislike using it as
> >> constant for an invalid value.
> >
> > Well, so is any negative number, but we should never be trying to kill
> > process groups.

Maybe it's just my dislike for constants that are easy to misuse - even
if it's not really relevant for the specific case.

> >>> I've got two functions that
> >>> need distinguishable return values for the not-yet-started case and
> >>> the already-dead case, neither of which can be real PIDs.  If we don't
> >>> use 0 and -1, the alternative is presumably to complicate the calling
> >>> signature with another out parameter.  That does not seem like a net
> >>> improvement.
> >>
> >> I actually think those cases would be better served by an error code
> >> mechanism which is extensible.
> >
> > Hmm.  What signature would you suggest?

so, the two functions are:
* int GetBackgroundWorkerPid(BackgroundWorkerHandle *handle);
* int WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle);

The return value means about the same with:
* If the return value is InvalidPid, it indicates that the worker has not yet been started by the postmaster.
* If the return value is 0, it indicates that the worker was started by the postmaster but is no longer running.
* Any other return value is the PID of the running worker.

So, I'd suggest something like:

typedef enum BgwHandleStatus {  BGWH_SUCCESS, /* sucessfully got status */  BGWH_NOT_YET, /* worker hasn't started yet
*/ BGWH_GONE, /* worker had been started, but shut down already */  BGWH_POSTMASTER_DIED /* well, there you go */
 
} BgwHandleStatus;


BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pid);
BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid);

> >> I can see the point in the argument, but it still seems to be grotty
> >> architecturally.
> >
> > Suck it up.  :-)

That I am able to do ;) Holidays help :P

> > I considered the idea of using SIGUSR2, but it didn't seem worth it.
> >
> >>> I don't understand what you mean by this.  Why does it ignore postmaster death?
> >>
> >> The other callers that check for WL_POSTMASTER_DEATH actually perform
> >> drastic measures like exit()ing upon postmaster. Here you just
> >> return. With the unix latch implementation it seems to be guaranteed
> >> that the next WaitLatch() done somewhere else will also return
> >> WL_POSTMASTER_DEATH. I am not sure whether it might actually "consume"
> >> the death in the windows implementation.
> >
> > Well, IMHO, every backend should exit as quick as possible upon
> > discovering that the postmaster has died.  However, for complex
> > political reasons (ah-Tom-Lane-choo!), our policy is to have the
> > auxiliary processes exit and regular backends soldier on.  This seems
> > completely stupid to me, but it's not this patch's job to reverse that
> > policy decision.

While I'd phrase it more politely/complex, agreed. And agreed with not
changing policy here.

I am not trying to argue for a different behaviour, but basically
wondering whether we need something like "LatchRearmPostmasterDeath()"
for the benefit of the windows implementation.

Maybe I am just dumb or years of not really touching windows got to me,
but I find the documentation around windows specific stuff utterly
lacking. There doesn't seem to be any comment on how the windows signal
queue stuff works on a high level.

> > To put that another way, one could equally well ask why WaitLatch(),
> > or for that matter PostmasterIsAlive(), doesn't contain ereport(FATAL)
> > upon discovering that the postmaster has given up the ghost.  And the
> > answer is that it's the job of those functions to provide information,
> > not make policy decisions.

No, there's actually good reasons for that imo. We e.g. want to be able
to have different handling in the wal writer. Or possibly at some later
point in some special background worker that sends keepalives and wants
to send an explicit message about this.

> >> I don't have a problem with the restriction, but I'd like to see a check
> >> against it. Maybe check for MyBackendId != InvalidBackendId in
> >> RegisterDynamicBackgroundWorker()? That would also prevent starting
> >> further bgworkers before BackgroundWorkerInitializeConnection() is done
> >> in a connected bgworker which seems to be a good thing.
> >
> > Well, that's easy enough to fix.  Should we Assert() or elog() or
> > what?

Some things I suggest actually are easy. Nice ;). I'd vote for elog(),
there's not much cost associated with it and a better error message is
good.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: killing pg_dump leaves backend process
Next
From: Jov
Date:
Subject: Re: [GENERAL] Enable WAL Archive in Replication server