Thread: dynamic background workers, round two

dynamic background workers, round two

From

Robert Haas

Date:

24 July 2013, 16:46:27

The dynamic background workers patch that I submitted for CF1 was
generally well-received, but several people commented on a significant
limitation: there's currently no way for a backend that requests a new
background worker to know whether that background worker was
successfully started.  If you're using the background worker mechanism
to run daemon processes of some sort, this is probably not a huge
problem.   You most likely don't start and stop those processes very
frequently, and when you do manually start them, you can always
examine the server log to see whether everything worked as planned.
Maybe not ideal, but not unbearably awful, either.

However, the goal I'm working towards here is parallel query, and for
that application, and some others, things don't look so good.  In that
case, you are probably launching a background worker flagged as
BGW_NEVER_RESTART, and so the postmaster is going to try to start it
just once, and if it doesn't work, you really need some way of knowing
that.  Of course, if the worker is launched successfully, you can have
it notify the process that started it via any mechanism you choose:
creating a sentinel file, inserting data into a table, setting the
process latch.  Sky's the limit.  However, if the worker isn't
launched successfully (fork fails, or the process crashes before it
reaches your code) you have no way of knowing.  If you don't receive
the agreed-upon notification from the child, it means that EITHER the
process was never started in the first place OR the postmaster just
hasn't gotten around to starting it yet.  Of course, you could wait
for a long time (5s ?) and then give up, but there's no good way to
set the wait time.  If you make it long, then it takes a long time for
you to report errors to the client even when those errors happen
quickly.  If you make it short, you may time out spuriously on a busy
system.

The attached patch attempts to remedy this problem.  When you register
a background worker, you can obtain a "handle" that can subsequently
be used to query for the worker's PID.  If you additionally initialize
bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1
when worker startup has been attempted (successfully or not).  You can
wait for this signal by passing your handle to
WaitForBackgroundWorkerStartup(), which will return only when either
(1) an attempt has been made to start the worker or (2) the postmaster
is determined to have died.  This interface seems adequate for
something like worker_spi, where it's useful to know whether the child
was started or not (and return the PID if so) but that's about all
that's really needed.

More complex notification interfaces can also be coded using the
primitives introduced by this patch.  Setting bgw_notify_pid will
cause the postmaster to send SIGUSR1 every time it starts the worker
and every time the worker dies.  Every time you receive that signal,
you can call GetBackgroundWorkerPid() for each background worker to
find out which ones have started or terminated.  The only slight
difficulty is in waiting for receipt of that signal, which I
implemented by adding a new global called set_latch_on_sigusr1.
WaitForBackgroundWorkerStartup() uses that flag internally, but
there's nothing to prevent a caller from using it as part of their own
event loop.  I find the set_latch_on_sigusr1 flag  to be slightly
ugly, but it seems better and far safer than having the postmaster try
to actually set the latch itself - for example, it'd obviously be
unacceptable to pass a Latch * to the postmaster and have the
postmaster assume that pointer valid.

I'm hopeful that this is about as much fiddling with the background
worker mechanism per se as will be needed for parallel query.  Once we
have this, I think the next hurdle will be designing suitable IPC
mechanisms, so that there's a relatively easy way for the "parent"
process to pass information to its background workers and visca versa.
 I expect the main tool for that to be a dynamic shared memory
facility; but I'm hoping that won't be closely tied to background
workers, though they may be heavy users of it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

bgworker-feedback-v1.patch

Re: dynamic background workers, round two

From

Andres Freund

Date:

24 July 2013, 21:34:13

Hi,

On 2013-07-24 12:46:21 -0400, Robert Haas wrote:
> The attached patch attempts to remedy this problem.  When you register
> a background worker, you can obtain a "handle" that can subsequently
> be used to query for the worker's PID.  If you additionally initialize
> bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1
> when worker startup has been attempted (successfully or not).  You can
> wait for this signal by passing your handle to
> WaitForBackgroundWorkerStartup(), which will return only when either
> (1) an attempt has been made to start the worker or (2) the postmaster
> is determined to have died.  This interface seems adequate for
> something like worker_spi, where it's useful to know whether the child
> was started or not (and return the PID if so) but that's about all
> that's really needed.

This seems like a sensible idea to me. But, in the context of dynamic
query, don't we also need the reverse infrastructure of notifying a
bgworker that the client, that requested it to be started, has died?
Ending up with a dozen bgworkers running because the normal backend
FATALed doesn't seem nice.

> More complex notification interfaces can also be coded using the
> primitives introduced by this patch.  Setting bgw_notify_pid will
> cause the postmaster to send SIGUSR1 every time it starts the worker
> and every time the worker dies.  Every time you receive that signal,
> you can call GetBackgroundWorkerPid() for each background worker to
> find out which ones have started or terminated.  The only slight
> difficulty is in waiting for receipt of that signal, which I
> implemented by adding a new global called set_latch_on_sigusr1.
> WaitForBackgroundWorkerStartup() uses that flag internally, but
> there's nothing to prevent a caller from using it as part of their own
> event loop.  I find the set_latch_on_sigusr1 flag  to be slightly
> ugly, but it seems better and far safer than having the postmaster try
> to actually set the latch itself - for example, it'd obviously be
> unacceptable to pass a Latch * to the postmaster and have the
> postmaster assume that pointer valid.

I am with you with finding it somewhat ugly.

> --- a/contrib/worker_spi/worker_spi.c
> +++ b/contrib/worker_spi/worker_spi.c

Btw, I've posted a minimal regression test for bworkers/worker_spi in
20130724175742.GD10713@alap2.anarazel.de - I'd like to see some coverage
of this...

> +    int         bgw_notify_pid;

It really should be pid_t even though we're not consistently doing that
in the code.

> diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
> index f7ebd1a..6414291 100644
> --- a/src/backend/commands/async.c
> +++ b/src/backend/commands/async.c
> @@ -207,8 +207,6 @@ typedef struct QueueBackendStatus
>      QueuePosition pos;            /* backend has read queue up to here */
>  } QueueBackendStatus;
>  

> -#define InvalidPid                (-1)
> -

> diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
> index edced29..0aa540a 100644
> --- a/src/include/miscadmin.h
> +++ b/src/include/miscadmin.h
> @@ -28,6 +28,8 @@
>  
>  #define PG_BACKEND_VERSIONSTR "postgres (PostgreSQL) " PG_VERSION "\n"
>  
> +#define InvalidPid                (-1)
> +

That value strikes me as a rather bad idea. As a pid that represents
init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
rather not spread that outside async.c.

> +/*
> + * Report the PID of a newly-launched background worker in shared memory.
> + *
> + * This function should only be called from the postmaster.
> + */
> +void
> +ReportBackgroundWorkerPID(RegisteredBgWorker *rw)
> +{
> +    BackgroundWorkerSlot *slot;
> +
> +    Assert(rw->rw_shmem_slot < max_worker_processes);
> +    slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
> +    slot->pid = rw->rw_pid;
> +
> +    if (rw->rw_worker.bgw_notify_pid != 0)
> +        kill(rw->rw_worker.bgw_notify_pid, SIGUSR1);
> +}

Any reason not to use SendProcSignal() properly here? Just that you
don't know the BackendId? I seems unclean to reuse SIGUSR1 without using
the infrastructure built for that.

I first thought you didn't want to do it because you it touches shmem,
but given that ReportBackgroundWorkerPID() does so itself...

> + * If handle != NULL, we'll set *handle to a pointer that can subsequently
> + * be used as an argument to GetBackgroundWorkerPid().  The caller can
> + * free this pointer using pfree(), if desired.
>   */
>  bool
> -RegisterDynamicBackgroundWorker(BackgroundWorker *worker)
> +RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
> +
> BackgroundWorkerHandle **handle)

I'd just let the caller provide the memory, but whatever ;)

> +/*
> + * Wait for a background worker to start up.
> + *
> + * The return value is the same as for GetBackgroundWorkerPID, except that
> + * we only return InvalidPid if the postmaster has died (and therefore we
> + * know that the worker will never be started).
> + */
> +int
> +WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle)
> +{
> +    int        pid;
> +    int        rc;
> +
> +    set_latch_on_sigusr1 = true;
> +
> +    PG_TRY();
> +    {

None of the containing code should be allowed to raise signals
afaics. Was there anything specific you were thinking of? Or just
defense against leaking set_latch_on_sigusr1 = true?

> +        for (;;)
> +        {
> +            pid = GetBackgroundWorkerPid(handle);
> +            if (pid != InvalidPid)
> +                break;
> +            rc = WaitLatch(&MyProc->procLatch,
> +                           WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
> +            if (rc & WL_POSTMASTER_DEATH)
> +                return InvalidPid;

This basically ignores postmaster death. With unix_latch.c that
shouldn't be a problem because afaics it's implementation correctly
should report an error in the next WaitLatch without waiting as well. I
am not sure about the win32 implementation, is that ok there as well?

Also, shouldn't we somewhere check for interrupts in the loop? This
looks like it would run forever if we start to wait while postmaster is
about to shut down, because afair postmaster won't start new children in
that case. But we won't notice that we're asked to get shut down.

The usual and documented pattern for latches is to do* for (;;)* {*       ResetLatch();*       if (work to do)*
 Do Stuff();*       WaitLatch();* }

any reason for differing here? I don't really see a problem with the
pattern used here, but ...

> +    set_latch_on_sigusr1 = false;

So, the set_latch_on_sigusr1 logical exists so MyProc->procLatch doesn't
get SetLatch'ed unless we're in +WaitForBackgroundWorkerStartup()?

> +/*
> + * When a backend asks to be notified about worker state changes, we
> + * set a flag in its backend entry.  The background worker machinery needs
> + * to know when such backends exit.
> + */
> +bool
> +PostmasterMarkPIDForWorkerNotify(int pid)
> +{
> +    dlist_iter    iter;
> +    Backend    *bp;
> +
> +    dlist_foreach(iter, &BackendList)
> +    {
> +        bp = dlist_container(Backend, elem, iter.cur);
> +        if (bp->pid == pid)
> +        {
> +            bp->bgworker_notify = true;
> +            return true;
> +        }
> +    }
> +    return false;
> +}

Maybe we should have MyBackend?

Btw, you seem to want to support this in bgworkers started by a
bgworker. That's not going to work without some changes if the
"intermediate" bgworker is one without a backend since those don't use
procsignal_sigusr1_handler.

So, that's the first crop of things that came to mind. I'll sleep over
it, but so far I like it generally ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: dynamic background workers, round two

From

Michael Paquier

Date:

24 July 2013, 23:02:19

On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

On 2013-07-24 12:46:21 -0400, Robert Haas wrote:
> The attached patch attempts to remedy this problem. When you register
> a background worker, you can obtain a "handle" that can subsequently
> be used to query for the worker's PID. If you additionally initialize
> bgw_notify_pid = MyProcPid, then the postmaster will send you SIGUSR1
> when worker startup has been attempted (successfully or not). You can
> wait for this signal by passing your handle to
> WaitForBackgroundWorkerStartup(), which will return only when either
> (1) an attempt has been made to start the worker or (2) the postmaster
> is determined to have died. This interface seems adequate for
> something like worker_spi, where it's useful to know whether the child
> was started or not (and return the PID if so) but that's about all
> that's really needed.

This seems like a sensible idea to me. But, in the context of dynamic
query, don't we also need the reverse infrastructure of notifying a
bgworker that the client, that requested it to be started, has died?
Ending up with a dozen bgworkers running because the normal backend
FATALed doesn't seem nice.

Yes, this would be something necessary to have, but definitely it should be a separate patch. In order to satisfy that, you could register for each worker the PID of the parent process that started it, if it was started by an an other bgworker, and have postmaster scan the list of bgworkers that need to be stopped after the death of their parent if it was a bgworker. By the way, the PID registered in this case should not be bgw_notify_pid but something saved in BackgroundWorkerSlot. My 2c on that though, feel free to comment.

> More complex notification interfaces can also be coded using the
> primitives introduced by this patch. Setting bgw_notify_pid will
> cause the postmaster to send SIGUSR1 every time it starts the worker
> and every time the worker dies. Every time you receive that signal,
> you can call GetBackgroundWorkerPid() for each background worker to
> find out which ones have started or terminated. The only slight
> difficulty is in waiting for receipt of that signal, which I
> implemented by adding a new global called set_latch_on_sigusr1.
> WaitForBackgroundWorkerStartup() uses that flag internally, but
> there's nothing to prevent a caller from using it as part of their own
> event loop. I find the set_latch_on_sigusr1 flag to be slightly
> ugly, but it seems better and far safer than having the postmaster try
> to actually set the latch itself - for example, it'd obviously be
> unacceptable to pass a Latch * to the postmaster and have the
> postmaster assume that pointer valid.

I am with you with finding it somewhat ugly.

After a quick look at the code, yeah using a boolean here looks like an overkill.
--

Michael

Re: dynamic background workers, round two

From

Michael Paquier

Date:

24 July 2013, 23:03:22

On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:

> --- a/contrib/worker_spi/worker_spi.c
> +++ b/contrib/worker_spi/worker_spi.c

Btw, I've posted a minimal regression test for bworkers/worker_spi in
20130724175742.GD10713@alap2.anarazel.de - I'd like to see some coverage
of this...

I could not find this email ID in the archives. A link perhaps?

--
Michael

Re: dynamic background workers, round two

From

Andres Freund

Date:

24 July 2013, 23:05:53

On 2013-07-25 08:03:17 +0900, Michael Paquier wrote:
> On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > > --- a/contrib/worker_spi/worker_spi.c
> > > +++ b/contrib/worker_spi/worker_spi.c
> >
> > Btw, I've posted a minimal regression test for bworkers/worker_spi in
> > 20130724175742.GD10713@alap2.anarazel.de - I'd like to see some coverage
> > of this...
> >
> I could not find this email ID in the archives. A link perhaps?

Hm. Works for me:
http://www.postgresql.org/message-id/20130724175742.GD10713@alap2.anarazel.de

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: dynamic background workers, round two

From

Michael Paquier

Date:

25 July 2013, 00:45:00

On Thu, Jul 25, 2013 at 8:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-07-25 08:03:17 +0900, Michael Paquier wrote:
> On Thu, Jul 25, 2013 at 6:34 AM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > > --- a/contrib/worker_spi/worker_spi.c
> > > +++ b/contrib/worker_spi/worker_spi.c
> >
> > Btw, I've posted a minimal regression test for bworkers/worker_spi in
> > 20130724175742.GD10713@alap2.anarazel.de - I'd like to see some coverage
> > of this...
> >
> I could not find this email ID in the archives. A link perhaps?

Hm. Works for me:
http://www.postgresql.org/message-id/20130724175742.GD10713@alap2.anarazel.de

Oops. Thanks, the search field does not like if spaces are added with the ID.

--
Michael

Re: dynamic background workers, round two

From

Robert Haas

Date:

25 July 2013, 16:35:36

On Wed, Jul 24, 2013 at 5:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> This seems like a sensible idea to me. But, in the context of dynamic
> query, don't we also need the reverse infrastructure of notifying a
> bgworker that the client, that requested it to be started, has died?
> Ending up with a dozen bgworkers running because the normal backend
> FATALed doesn't seem nice.

I think that will be desirable for many, although not all, uses of
background workers.  For example, you might want to be able to do
something like SELECT pg_background('CREATE INDEX CONCURRENTLY ...')
and then exit your session and have the background worker continue to
run.  Or maybe somebody writes a trigger that starts a background
worker (if there's not one already working) and queues up work for the
trigger, and the background worker continues processing the queue
until it is empty, and then exits.

But for parallel query specifically, yeah, we'll probably want the
untimely demise of either the original backend or any of its workers
to result in the death of all the workers and an abort of the parent's
transaction.  However, there's no need for the postmaster to be
involved in any of that.  For example, once the worker process is up
and running and the parent has its PID, it can use an on_shmem_exit
hook or a PG_ENSURE_ERROR_CLEANUP() block to send SIGTERM to any
still-living workers.  In the other direction, the parent needs to
know not only whether or not the child has died after startup, but
most likely also needs to receive errors, notices, tuples, or other
data from the child.  If the parent receives an ERROR or FATAL
indication from a parallel-worker child, it should most likely kill
all other workers and then abort the transaction.  But in neither of
those cases does the postmaster need to be involved.

The reason why this particular mechanism is needed is because the
worker isn't able to report its own failure to start.  Only the
postmaster can do that.  Once both processes are up and running, it's
up to them to handle their own IPC needs.

> It really should be pid_t even though we're not consistently doing that
> in the code.

Hmm, I thought I saw we weren't using that type.  But I guess we are,
so I can go make this consistent.

>> +#define InvalidPid                           (-1)
>> +
>
> That value strikes me as a rather bad idea. As a pid that represents
> init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
> rather not spread that outside async.c.

Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug.
And if the postgres process has permission to send signals to init,
that's an OS bug (or a PostgreSQL bug, since we refuse to run as
root).  So I'm not very worried about it.  I've got two functions that
need distinguishable return values for the not-yet-started case and
the already-dead case, neither of which can be real PIDs.  If we don't
use 0 and -1, the alternative is presumably to complicate the calling
signature with another out parameter.  That does not seem like a net
improvement.

>> +/*
>> + * Report the PID of a newly-launched background worker in shared memory.
>> + *
>> + * This function should only be called from the postmaster.
>> + */
>> +void
>> +ReportBackgroundWorkerPID(RegisteredBgWorker *rw)
>> +{
>> +     BackgroundWorkerSlot *slot;
>> +
>> +     Assert(rw->rw_shmem_slot < max_worker_processes);
>> +     slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
>> +     slot->pid = rw->rw_pid;
>> +
>> +     if (rw->rw_worker.bgw_notify_pid != 0)
>> +             kill(rw->rw_worker.bgw_notify_pid, SIGUSR1);
>> +}
>
> Any reason not to use SendProcSignal() properly here? Just that you
> don't know the BackendId? I seems unclean to reuse SIGUSR1 without using
> the infrastructure built for that.

We're in the postmaster here.  The postmaster currently uses kill
directly in all cases, rather than SendProcSignal(), and I'd rather
not change that.  Any code that the postmaster calls has got to be
safe against arbitrary shared memory corruption, and I'd rather keep
that footprint as small as possible.  For example, if we added
bgw_backend_id, the postmaster would have to bounds check it to make
sure it didn't seg fault.  The more of that kind of stuff we add, the
more chances there are to destabilize the postmaster.  I'd like to
keep it as minimal as possible.

>> + * If handle != NULL, we'll set *handle to a pointer that can subsequently
>> + * be used as an argument to GetBackgroundWorkerPid().  The caller can
>> + * free this pointer using pfree(), if desired.
>>   */
>>  bool
>> -RegisterDynamicBackgroundWorker(BackgroundWorker *worker)
>> +RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
>> +
>> BackgroundWorkerHandle **handle)
>
> I'd just let the caller provide the memory, but whatever ;)

It could, but then the structure couldn't be opaque.  That wouldn't be
a disaster but this provides better protection against future API
violations.

> None of the containing code should be allowed to raise signals
> afaics. Was there anything specific you were thinking of? Or just
> defense against leaking set_latch_on_sigusr1 = true?

That's it exactly.

>> +             for (;;)
>> +             {
>> +                     pid = GetBackgroundWorkerPid(handle);
>> +                     if (pid != InvalidPid)
>> +                             break;
>> +                     rc = WaitLatch(&MyProc->procLatch,
>> +                                                WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
>> +                     if (rc & WL_POSTMASTER_DEATH)
>> +                             return InvalidPid;
>
> This basically ignores postmaster death. With unix_latch.c that
> shouldn't be a problem because afaics it's implementation correctly
> should report an error in the next WaitLatch without waiting as well. I
> am not sure about the win32 implementation, is that ok there as well?

I don't understand what you mean by this.  Why does it ignore postmaster death?

> Also, shouldn't we somewhere check for interrupts in the loop? This
> looks like it would run forever if we start to wait while postmaster is
> about to shut down, because afair postmaster won't start new children in
> that case. But we won't notice that we're asked to get shut down.

Oh, right.  I meant to put a CHECK_FOR_INTERRUPTS() in that loop, but forgot.

> The usual and documented pattern for latches is to do
>  * for (;;)
>  * {
>  *         ResetLatch();
>  *         if (work to do)
>  *                 Do Stuff();
>  *         WaitLatch();
>  * }
> any reason for differing here? I don't really see a problem with the
> pattern used here, but ...

Yeah, that'd be fine, too.  But I actually think it's better to reset
the latch at the bottom of the loop, because that way if the condition
is already satisfied on first entering the loop, we can exit the loop
without resetting the latch.  That saves the next person who sets the
latch the trouble of sending a signal or self-pipe byte, because the
latch is already set.

>> +     set_latch_on_sigusr1 = false;
>
> So, the set_latch_on_sigusr1 logical exists so MyProc->procLatch doesn't
> get SetLatch'ed unless we're in +WaitForBackgroundWorkerStartup()?

Right.  Noah suggested to me that we might do this unconditionally,
which would certainly be simpler and cleaner.  I dunno if it'd risk
too many spurious wake-ups in any scenario, though.  I kind of doubt
it, but I don't know.

>> +/*
>> + * When a backend asks to be notified about worker state changes, we
>> + * set a flag in its backend entry.  The background worker machinery needs
>> + * to know when such backends exit.
>> + */
>> +bool
>> +PostmasterMarkPIDForWorkerNotify(int pid)
>> +{
>> +     dlist_iter      iter;
>> +     Backend    *bp;
>> +
>> +     dlist_foreach(iter, &BackendList)
>> +     {
>> +             bp = dlist_container(Backend, elem, iter.cur);
>> +             if (bp->pid == pid)
>> +             {
>> +                     bp->bgworker_notify = true;
>> +                     return true;
>> +             }
>> +     }
>> +     return false;
>> +}
>
> Maybe we should have MyBackend?

How would that help?  This code is run in the postmaster, so that we
only clear iterate through workers and clear bgw_notify_pid when the
backend that has just exited has at some point appeared in a
bgw_notify_pid.

> Btw, you seem to want to support this in bgworkers started by a
> bgworker. That's not going to work without some changes if the
> "intermediate" bgworker is one without a backend since those don't use
> procsignal_sigusr1_handler.

Right.  I think it's OK for now to limit it to cases where the
intermediate bgworker has a backend.  If someone else finds that
restriction unacceptable, they can fix it.

> So, that's the first crop of things that came to mind. I'll sleep over
> it, but so far I like it generally ;)

Thanks for the quick review!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: dynamic background workers, round two

From

Andres Freund

Date:

26 July 2013, 09:34:55

On 2013-07-25 12:35:30 -0400, Robert Haas wrote:
> On Wed, Jul 24, 2013 at 5:34 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > This seems like a sensible idea to me. But, in the context of dynamic
> > query, don't we also need the reverse infrastructure of notifying a
> > bgworker that the client, that requested it to be started, has died?
> > Ending up with a dozen bgworkers running because the normal backend
> > FATALed doesn't seem nice.
> 
> I think that will be desirable for many, although not all, uses of
> background workers.  For example, you might want to be able to do
> something like SELECT pg_background('CREATE INDEX CONCURRENTLY ...')
> and then exit your session and have the background worker continue to
> run.  Or maybe somebody writes a trigger that starts a background
> worker (if there's not one already working) and queues up work for the
> trigger, and the background worker continues processing the queue
> until it is empty, and then exits.

> But for parallel query specifically, yeah, we'll probably want the
> untimely demise of either the original backend or any of its workers
> to result in the death of all the workers and an abort of the parent's
> transaction.  However, there's no need for the postmaster to be
> involved in any of that.  For example, once the worker process is up
> and running and the parent has its PID, it can use an on_shmem_exit
> hook or a PG_ENSURE_ERROR_CLEANUP() block to send SIGTERM to any
> still-living workers.

It doesn't need to be the postmaster, but I think we need to provide
central infrastructure for that. I don't want this to end up being
redone poorly in multiple places.
I just wanted to mention it, it obviously doesn't need to be implemented
at the same time.

> >> +#define InvalidPid                           (-1)
> >> +
> >
> > That value strikes me as a rather bad idea. As a pid that represents
> > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
> > rather not spread that outside async.c.
> 
> Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug.
> And if the postgres process has permission to send signals to init,
> that's an OS bug (or a PostgreSQL bug, since we refuse to run as
> root).  So I'm not very worried about it.

I don't think somebody mistakenly kill()ing somebody with -1 is all too
likely, but it's a valid value to pass. That's why I dislike using it as
constant for an invalid value.

> I've got two functions that
> need distinguishable return values for the not-yet-started case and
> the already-dead case, neither of which can be real PIDs.  If we don't
> use 0 and -1, the alternative is presumably to complicate the calling
> signature with another out parameter.  That does not seem like a net
> improvement.

I actually think those cases would be better served by an error code
mechanism which is extensible.

> >> +/*
> >> + * Report the PID of a newly-launched background worker in shared memory.
> >> + *
> >> + * This function should only be called from the postmaster.
> >> + */
> >> +void
> >> +ReportBackgroundWorkerPID(RegisteredBgWorker *rw)
> >> +{
> >> +     BackgroundWorkerSlot *slot;
> >> +
> >> +     Assert(rw->rw_shmem_slot < max_worker_processes);
> >> +     slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
> >> +     slot->pid = rw->rw_pid;
> >> +
> >> +     if (rw->rw_worker.bgw_notify_pid != 0)
> >> +             kill(rw->rw_worker.bgw_notify_pid, SIGUSR1);
> >> +}
> >
> > Any reason not to use SendProcSignal() properly here? Just that you
> > don't know the BackendId? I seems unclean to reuse SIGUSR1 without using
> > the infrastructure built for that.
> 
> We're in the postmaster here.  The postmaster currently uses kill
> directly in all cases, rather than SendProcSignal(), and I'd rather
> not change that.  Any code that the postmaster calls has got to be
> safe against arbitrary shared memory corruption, and I'd rather keep
> that footprint as small as possible.  For example, if we added
> bgw_backend_id, the postmaster would have to bounds check it to make
> sure it didn't seg fault.  The more of that kind of stuff we add, the
> more chances there are to destabilize the postmaster.  I'd like to
> keep it as minimal as possible.

I can see the point in the argument, but it still seems to be grotty
architecturally.

> >> +             for (;;)
> >> +             {
> >> +                     pid = GetBackgroundWorkerPid(handle);
> >> +                     if (pid != InvalidPid)
> >> +                             break;
> >> +                     rc = WaitLatch(&MyProc->procLatch,
> >> +                                                WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
> >> +                     if (rc & WL_POSTMASTER_DEATH)
> >> +                             return InvalidPid;
> >
> > This basically ignores postmaster death. With unix_latch.c that
> > shouldn't be a problem because afaics it's implementation correctly
> > should report an error in the next WaitLatch without waiting as well. I
> > am not sure about the win32 implementation, is that ok there as well?
> 
> I don't understand what you mean by this.  Why does it ignore postmaster death?

The other callers that check for WL_POSTMASTER_DEATH actually perform
drastic measures like exit()ing upon postmaster. Here you just
return. With the unix latch implementation it seems to be guaranteed
that the next WaitLatch() done somewhere else will also return
WL_POSTMASTER_DEATH. I am not sure whether it might actually "consume"
the death in the windows implementation.

> >> +     set_latch_on_sigusr1 = false;
> >
> > So, the set_latch_on_sigusr1 logical exists so MyProc->procLatch doesn't
> > get SetLatch'ed unless we're in +WaitForBackgroundWorkerStartup()?
> 
> Right.  Noah suggested to me that we might do this unconditionally,
> which would certainly be simpler and cleaner.  I dunno if it'd risk
> too many spurious wake-ups in any scenario, though.  I kind of doubt
> it, but I don't know.

If we were to using the procsignal mechanism, I wouldn't worry about
doing it unconditionally, but signals for the other reasons SIGUSR1 is
sent might actually come in relatively frequently.

> >> +/*
> >> + * When a backend asks to be notified about worker state changes, we
> >> + * set a flag in its backend entry.  The background worker machinery needs
> >> + * to know when such backends exit.
> >> + */
> >> +bool
> >> +PostmasterMarkPIDForWorkerNotify(int pid)
> >> +{
> >> +     dlist_iter      iter;
> >> +     Backend    *bp;
> >> +
> >> +     dlist_foreach(iter, &BackendList)
> >> +     {
> >> +             bp = dlist_container(Backend, elem, iter.cur);
> >> +             if (bp->pid == pid)
> >> +             {
> >> +                     bp->bgworker_notify = true;
> >> +                     return true;
> >> +             }
> >> +     }
> >> +     return false;
> >> +}
> >
> > Maybe we should have MyBackend?
> 
> How would that help?  This code is run in the postmaster, so that we
> only clear iterate through workers and clear bgw_notify_pid when the
> backend that has just exited has at some point appeared in a
> bgw_notify_pid.

Thinko.

> > Btw, you seem to want to support this in bgworkers started by a
> > bgworker. That's not going to work without some changes if the
> > "intermediate" bgworker is one without a backend since those don't use
> > procsignal_sigusr1_handler.

> Right.  I think it's OK for now to limit it to cases where the
> intermediate bgworker has a backend.  If someone else finds that
> restriction unacceptable, they can fix it.

I don't have a problem with the restriction, but I'd like to see a check
against it. Maybe check for MyBackendId != InvalidBackendId in
RegisterDynamicBackgroundWorker()? That would also prevent starting
further bgworkers before BackgroundWorkerInitializeConnection() is done
in a connected bgworker which seems to be a good thing.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: dynamic background workers, round two

From

Robert Haas

Date:

26 July 2013, 12:50:56

On Fri, Jul 26, 2013 at 5:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> It doesn't need to be the postmaster, but I think we need to provide
> central infrastructure for that. I don't want this to end up being
> redone poorly in multiple places.
> I just wanted to mention it, it obviously doesn't need to be implemented
> at the same time.

OK, makes sense.  I'm still working out a design for this part of it;
I think I want to get the dynamic shared memory stuff working first.

>> >> +#define InvalidPid                           (-1)
>> >> +
>> >
>> > That value strikes me as a rather bad idea. As a pid that represents
>> > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
>> > rather not spread that outside async.c.
>>
>> Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug.
>> And if the postgres process has permission to send signals to init,
>> that's an OS bug (or a PostgreSQL bug, since we refuse to run as
>> root).  So I'm not very worried about it.
>
> I don't think somebody mistakenly kill()ing somebody with -1 is all too
> likely, but it's a valid value to pass. That's why I dislike using it as
> constant for an invalid value.

Well, so is any negative number, but we should never be trying to kill
process groups.

>> I've got two functions that
>> need distinguishable return values for the not-yet-started case and
>> the already-dead case, neither of which can be real PIDs.  If we don't
>> use 0 and -1, the alternative is presumably to complicate the calling
>> signature with another out parameter.  That does not seem like a net
>> improvement.
>
> I actually think those cases would be better served by an error code
> mechanism which is extensible.

Hmm.  What signature would you suggest?

> I can see the point in the argument, but it still seems to be grotty
> architecturally.

Suck it up.  :-)

I considered the idea of using SIGUSR2, but it didn't seem worth it.

>> I don't understand what you mean by this.  Why does it ignore postmaster death?
>
> The other callers that check for WL_POSTMASTER_DEATH actually perform
> drastic measures like exit()ing upon postmaster. Here you just
> return. With the unix latch implementation it seems to be guaranteed
> that the next WaitLatch() done somewhere else will also return
> WL_POSTMASTER_DEATH. I am not sure whether it might actually "consume"
> the death in the windows implementation.

Well, IMHO, every backend should exit as quick as possible upon
discovering that the postmaster has died.  However, for complex
political reasons (ah-Tom-Lane-choo!), our policy is to have the
auxiliary processes exit and regular backends soldier on.  This seems
completely stupid to me, but it's not this patch's job to reverse that
policy decision.

To put that another way, one could equally well ask why WaitLatch(),
or for that matter PostmasterIsAlive(), doesn't contain ereport(FATAL)
upon discovering that the postmaster has given up the ghost.  And the
answer is that it's the job of those functions to provide information,
not make policy decisions.

> I don't have a problem with the restriction, but I'd like to see a check
> against it. Maybe check for MyBackendId != InvalidBackendId in
> RegisterDynamicBackgroundWorker()? That would also prevent starting
> further bgworkers before BackgroundWorkerInitializeConnection() is done
> in a connected bgworker which seems to be a good thing.

Well, that's easy enough to fix.  Should we Assert() or elog() or what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: dynamic background workers, round two

From

Andres Freund

Date:

11 August 2013, 05:35:01

[sent again, previously sent as reply, instead of reply-all, thanks
Robert]

On 2013-08-09 09:09:08 -0400, Robert Haas wrote:
> On Fri, Jul 26, 2013 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Jul 26, 2013 at 5:34 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> It doesn't need to be the postmaster, but I think we need to provide
> >> central infrastructure for that. I don't want this to end up being
> >> redone poorly in multiple places.
> >> I just wanted to mention it, it obviously doesn't need to be implemented
> >> at the same time.
> >
> > OK, makes sense.  I'm still working out a design for this part of it;
> > I think I want to get the dynamic shared memory stuff working first.

Makes sense. I just wanted to mention that we should keep on the back of
our minds while working on this.

> >>> >> +#define InvalidPid                           (-1)
> >>> >> +
> >>> >
> >>> > That value strikes me as a rather bad idea. As a pid that represents
> >>> > init's processgroup. I.e. kill(InvalidPid) will try to kill init... I'd
> >>> > rather not spread that outside async.c.
> >>>
> >>> Well, if we ever try to kill(InvalidPid), that's a PostgreSQL bug.
> >>> And if the postgres process has permission to send signals to init,
> >>> that's an OS bug (or a PostgreSQL bug, since we refuse to run as
> >>> root).  So I'm not very worried about it.
> >>
> >> I don't think somebody mistakenly kill()ing somebody with -1 is all too
> >> likely, but it's a valid value to pass. That's why I dislike using it as
> >> constant for an invalid value.
> >
> > Well, so is any negative number, but we should never be trying to kill
> > process groups.

Maybe it's just my dislike for constants that are easy to misuse - even
if it's not really relevant for the specific case.

> >>> I've got two functions that
> >>> need distinguishable return values for the not-yet-started case and
> >>> the already-dead case, neither of which can be real PIDs.  If we don't
> >>> use 0 and -1, the alternative is presumably to complicate the calling
> >>> signature with another out parameter.  That does not seem like a net
> >>> improvement.
> >>
> >> I actually think those cases would be better served by an error code
> >> mechanism which is extensible.
> >
> > Hmm.  What signature would you suggest?

so, the two functions are:
* int GetBackgroundWorkerPid(BackgroundWorkerHandle *handle);
* int WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle);

The return value means about the same with:
* If the return value is InvalidPid, it indicates that the worker has not yet been started by the postmaster.
* If the return value is 0, it indicates that the worker was started by the postmaster but is no longer running.
* Any other return value is the PID of the running worker.

So, I'd suggest something like:

typedef enum BgwHandleStatus {  BGWH_SUCCESS, /* sucessfully got status */  BGWH_NOT_YET, /* worker hasn't started yet
*/ BGWH_GONE, /* worker had been started, but shut down already */  BGWH_POSTMASTER_DIED /* well, there you go */
 
} BgwHandleStatus;


BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pid);
BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid);

> >> I can see the point in the argument, but it still seems to be grotty
> >> architecturally.
> >
> > Suck it up.  :-)

That I am able to do ;) Holidays help :P

> > I considered the idea of using SIGUSR2, but it didn't seem worth it.
> >
> >>> I don't understand what you mean by this.  Why does it ignore postmaster death?
> >>
> >> The other callers that check for WL_POSTMASTER_DEATH actually perform
> >> drastic measures like exit()ing upon postmaster. Here you just
> >> return. With the unix latch implementation it seems to be guaranteed
> >> that the next WaitLatch() done somewhere else will also return
> >> WL_POSTMASTER_DEATH. I am not sure whether it might actually "consume"
> >> the death in the windows implementation.
> >
> > Well, IMHO, every backend should exit as quick as possible upon
> > discovering that the postmaster has died.  However, for complex
> > political reasons (ah-Tom-Lane-choo!), our policy is to have the
> > auxiliary processes exit and regular backends soldier on.  This seems
> > completely stupid to me, but it's not this patch's job to reverse that
> > policy decision.

While I'd phrase it more politely/complex, agreed. And agreed with not
changing policy here.

I am not trying to argue for a different behaviour, but basically
wondering whether we need something like "LatchRearmPostmasterDeath()"
for the benefit of the windows implementation.

Maybe I am just dumb or years of not really touching windows got to me,
but I find the documentation around windows specific stuff utterly
lacking. There doesn't seem to be any comment on how the windows signal
queue stuff works on a high level.

> > To put that another way, one could equally well ask why WaitLatch(),
> > or for that matter PostmasterIsAlive(), doesn't contain ereport(FATAL)
> > upon discovering that the postmaster has given up the ghost.  And the
> > answer is that it's the job of those functions to provide information,
> > not make policy decisions.

No, there's actually good reasons for that imo. We e.g. want to be able
to have different handling in the wal writer. Or possibly at some later
point in some special background worker that sends keepalives and wants
to send an explicit message about this.

> >> I don't have a problem with the restriction, but I'd like to see a check
> >> against it. Maybe check for MyBackendId != InvalidBackendId in
> >> RegisterDynamicBackgroundWorker()? That would also prevent starting
> >> further bgworkers before BackgroundWorkerInitializeConnection() is done
> >> in a connected bgworker which seems to be a good thing.
> >
> > Well, that's easy enough to fix.  Should we Assert() or elog() or
> > what?

Some things I suggest actually are easy. Nice ;). I'd vote for elog(),
there's not much cost associated with it and a better error message is
good.

Greetings,

Andres Freund

Re: dynamic background workers, round two

From

Robert Haas

Date:

17 August 2013, 16:08:24

On Sun, Aug 11, 2013 at 1:31 AM, Andres Freund <andres@anarazel.de> wrote:
> So, I'd suggest something like:
>
> typedef enum BgwHandleStatus {
>    BGWH_SUCCESS, /* sucessfully got status */
>    BGWH_NOT_YET, /* worker hasn't started yet */
>    BGWH_GONE, /* worker had been started, but shut down already */
>    BGWH_POSTMASTER_DIED /* well, there you go */
> } BgwHandleStatus;
>
>
> BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pid);
> BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid);

OK, here's a patch that API.  I renamed the constants a bit, because a
process that has stopped is not necessarily gone; it could be
configured for restart.  But we can say that it is stopped, at the
moment.

I'm not sure that this API is an improvement.  But I think it's OK, if
you prefer it.

...Robert

Attachment

bgworker-feedback-v2.patch

Re: dynamic background workers, round two

From

Andres Freund

Date:

27 August 2013, 13:50:36

Hi Robert,

On 2013-08-17 12:08:17 -0400, Robert Haas wrote:
> On Sun, Aug 11, 2013 at 1:31 AM, Andres Freund <andres@anarazel.de> wrote:
> > So, I'd suggest something like:
> >
> > typedef enum BgwHandleStatus {
> >    BGWH_SUCCESS, /* sucessfully got status */
> >    BGWH_NOT_YET, /* worker hasn't started yet */
> >    BGWH_GONE, /* worker had been started, but shut down already */
> >    BGWH_POSTMASTER_DIED /* well, there you go */
> > } BgwHandleStatus;
> >
> >
> > BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pid);
> > BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid);
> 
> OK, here's a patch that API.  I renamed the constants a bit, because a
> process that has stopped is not necessarily gone; it could be
> configured for restart.  But we can say that it is stopped, at the
> moment.
> 
> I'm not sure that this API is an improvement.  But I think it's OK, if
> you prefer it.

Thanks, looks more readable to me. I like code that tells me what it
does without having to look in other places ;)

> +  <function>RegisterDynamicBackgroundWorker(<type>BackgroundWorker
> +  *worker, BackgroundWorkerHandle **handle</type>)</function>.  Unlike
> +  <function>RegisterBackgroundWorker</>, which can only be called from within
> +  the postmaster, <function>RegisterDynamicBackgroundWorker</function> must be
> +  called from a regular backend.
>   </para>

s/which can only be called from within the posmaster/during initial
startup, via shared_preload_libraries/?

>    <para>
> +   When a background worker is registered using the
> +   <function>RegisterDynamicBackgroundWorker</function> function, it is
> +   possible for the backend performing the registration to obtain information
> +   the status of the worker.  Backends wishing to do this should pass the
> +   address of a <type>BackgroundWorkerHandle *</type> as the second argument
> +   to <function>RegisterDynamicBackgroundWorker</function>.  If the worker
> +   is successfully registered, this pointer will be initialized with an
> +   opaque handle that can subsequently be passed to
> +   <function>GetBackgroundWorkerPid(<parameter>BackgroundWorkerHandle *</parameter>, <parameter>pid_t
*</parameter>)</function>.
> +   This function can be used to poll the status of the worker: a return
> +   value of <literal>BGWH_NOT_YET_STARTED</> indicates that the worker has not
> +   yet been started by the postmaster; <literal>BGWH_STOPPED</literal>
> +   indicates that it has been started but is no longer running; and
> +   <literal>BGWH_STATED</literal> indicates that it is currently running.
> +   In this last case, the PID will also be returned via the second argument.
> +  </para>

s/STATED/STARTED/

> diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
> index f7ebd1a..6414291 100644
> --- a/src/backend/commands/async.c
> +++ b/src/backend/commands/async.c

> -#define InvalidPid                (-1)
> -

Hrmpf ;)

>  bool
> -RegisterDynamicBackgroundWorker(BackgroundWorker *worker)
> +RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
> +                                BackgroundWorkerHandle **handle)
>  {

Hm. Not this patches fault, but We seem to allow bgw_start_time ==
BgWorkerStart_PostmasterStart here which doesn't make sense...

> +/*
> + * Get the PID of a dynamically-registered background worker.
> + *
> + * If the return value is InvalidPid, it indicates that the worker has not
> + * yet been started by the postmaster.
> + *
> + * If the return value is 0, it indicates that the worker was started by
> + * the postmaster but is no longer running.
> + *
> + * Any other return value is the PID of the running worker.
> + */
> +BgwHandleStatus
> +GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pidp)
> +{
> +    BackgroundWorkerSlot *slot;
> +    pid_t    pid;
> +
> +    slot = &BackgroundWorkerData->slot[handle->slot];

Maybe add something like Assert(hanlde->slot < max_worker_processes);?

> +/*
> + * Wait for a background worker to start up.
> + *
> + * The return value is the same as for GetBackgroundWorkerPID, except that
> + * we only return InvalidPid if the postmaster has died (and therefore we
> + * know that the worker will never be started).
> + */
> +BgwHandleStatus
> +WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pidp)
> +{
> +    BgwHandleStatus    status;
> +    pid_t    pid;
> +    int        rc;
> +
> +    set_latch_on_sigusr1 = true;
> +
> +    PG_TRY();
> +    {
> +        for (;;)
> +        {
> +            status = GetBackgroundWorkerPid(handle, &pid);
> +            if (status != BGWH_NOT_YET_STARTED)
> +                break;
> +            rc = WaitLatch(&MyProc->procLatch,
> +                           WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
> +            if (rc & WL_POSTMASTER_DEATH)
> +            {
> +                status = BGWH_POSTMASTER_DIED;
> +                break;
> +            }
> +            ResetLatch(&MyProc->procLatch);
> +        }
> +    }
> +    PG_CATCH();
> +    {
> +        set_latch_on_sigusr1 = false;
> +        PG_RE_THROW();
> +    }
> +    PG_END_TRY();
> +
> +    set_latch_on_sigusr1 = false;
> +    *pidp = pid;
> +    return status;
> +}

Shouldn't that loop have a CHECK_FOR_INTERRUPTS()?

Theoretically this would unset set_latch_on_sigusr1 if it was previously
already set to 'true'. If you feel defensive you could safe the previous
value...

So, besides those I don't see much left to be done. I haven't tested
EXEC_BACKEND, but I don't anything specific to that here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: dynamic background workers, round two

From

Andres Freund

Date:

27 August 2013, 13:56:43

On 2013-07-26 08:50:51 -0400, Robert Haas wrote:
> > > > Btw, you seem to want to support this in bgworkers started by a
> > > > bgworker. That's not going to work without some changes if the
> > > > "intermediate" bgworker is one without a backend since those don't use
> > > > procsignal_sigusr1_handler.

> > > Right.  I think it's OK for now to limit it to cases where the
> > > intermediate bgworker has a backend.  If someone else finds that
> > > restriction unacceptable, they can fix it.

> > I don't have a problem with the restriction, but I'd like to see a check
> > against it. Maybe check for MyBackendId != InvalidBackendId in
> > RegisterDynamicBackgroundWorker()? That would also prevent starting
> > further bgworkers before BackgroundWorkerInitializeConnection() is done
> > in a connected bgworker which seems to be a good thing.
> 
> Well, that's easy enough to fix.  Should we Assert() or elog() or
> what?

I think that's not in the patch yet either.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: dynamic background workers, round two

From

Robert Haas

Date:

28 August 2013, 18:05:05

On Tue, Aug 27, 2013 at 9:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > BgwHandleStatus GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pid);
>> > BgwHandleStatus WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pid);
>>
>> OK, here's a patch that API.  I renamed the constants a bit, because a
>> process that has stopped is not necessarily gone; it could be
>> configured for restart.  But we can say that it is stopped, at the
>> moment.
>>
>> I'm not sure that this API is an improvement.  But I think it's OK, if
>> you prefer it.
>
> Thanks, looks more readable to me. I like code that tells me what it
> does without having to look in other places ;)

Well, fair enough.  But you might have to look around a bit to figure
out that you now have two functions each of which can return 3 out of
a possible 4 values.  But whatever.

>> +  <function>RegisterDynamicBackgroundWorker(<type>BackgroundWorker
>> +  *worker, BackgroundWorkerHandle **handle</type>)</function>.  Unlike
>> +  <function>RegisterBackgroundWorker</>, which can only be called from within
>> +  the postmaster, <function>RegisterDynamicBackgroundWorker</function> must be
>> +  called from a regular backend.
>>   </para>
>
> s/which can only be called from within the posmaster/during initial
> startup, via shared_preload_libraries/?

This seems like a possible doc clarification not intimately related to
the patch at hand.  The only reason that paragraph is getting reflowed
is because of the additional argument to
RegisterDynamicBackgroundWorker().  On the substance, I could go
either way on whether your proposed text is better than what's there
now.  It seems like it might need a bit of wordsmithing, anyway.

> s/STATED/STARTED/

Good catch.

>>  bool
>> -RegisterDynamicBackgroundWorker(BackgroundWorker *worker)
>> +RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
>> +                                                             BackgroundWorkerHandle **handle)
>>  {
>
> Hm. Not this patches fault, but We seem to allow bgw_start_time ==
> BgWorkerStart_PostmasterStart here which doesn't make sense...

I can add a check for that.  I agree that it's a separate patch.

> Maybe add something like Assert(hanlde->slot < max_worker_processes);?

Sure.

> Shouldn't that loop have a CHECK_FOR_INTERRUPTS()?

Yep.

>
> Theoretically this would unset set_latch_on_sigusr1 if it was previously
> already set to 'true'. If you feel defensive you could safe the previous
> value...

Probably a good plan.

> So, besides those I don't see much left to be done. I haven't tested
> EXEC_BACKEND, but I don't anything specific to that here.

I certainly can't promise that the code is bug-free.  But I think it's
probably better to get this into the tree and let people start playing
around with it than to continue to maintain it in my private sandbox.
At this point it's just infrastructure anyway, though there seems to
be a good deal of interest in it from more than just myself...  and I
do think it's a useful step forward from where we are today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: dynamic background workers, round two

From

Robert Haas

Date:

28 August 2013, 18:15:40

On Wed, Aug 28, 2013 at 2:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I certainly can't promise that the code is bug-free.  But I think it's
> probably better to get this into the tree and let people start playing
> around with it than to continue to maintain it in my private sandbox.
> At this point it's just infrastructure anyway, though there seems to
> be a good deal of interest in it from more than just myself...  and I
> do think it's a useful step forward from where we are today.

Committed with fixes for the issues you mentioned, plus some updates
to out-of-date comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: dynamic background workers, round two

From

Robert Haas

Date:

28 August 2013, 18:18:39

On Wed, Aug 28, 2013 at 2:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Hm. Not this patches fault, but We seem to allow bgw_start_time ==
>> BgWorkerStart_PostmasterStart here which doesn't make sense...
>
> I can add a check for that.  I agree that it's a separate patch.

On third thought, is there really any point to such a check?  I mean,
no background worker is going to start before it's registered, and by
the time any background worker is registered, we'll be passed the time
indicated by all of those constants: BgWorkerStart_PostmasterStart,
BgWorkerStart_ConsistentState, BgWorkerStart_RecoveryFinished.

I think we should view that field as fixing the earliest time at which
the worker should be started, rather than the exact time at which it
must be started.  Otherwise no value is sensible.  And if we take that
approach, then for a dynamic background worker, any value is OK.

...Robert

Re: dynamic background workers, round two

From

Andres Freund

Date:

28 August 2013, 18:44:14

On 2013-08-28 14:04:59 -0400, Robert Haas wrote:
> >> +  <function>RegisterDynamicBackgroundWorker(<type>BackgroundWorker
> >> +  *worker, BackgroundWorkerHandle **handle</type>)</function>.  Unlike
> >> +  <function>RegisterBackgroundWorker</>, which can only be called from within
> >> +  the postmaster, <function>RegisterDynamicBackgroundWorker</function> must be
> >> +  called from a regular backend.
> >>   </para>
> >
> > s/which can only be called from within the posmaster/during initial
> > startup, via shared_preload_libraries/?
> 
> This seems like a possible doc clarification not intimately related to
> the patch at hand.  The only reason that paragraph is getting reflowed
> is because of the additional argument to
> RegisterDynamicBackgroundWorker().  On the substance, I could go
> either way on whether your proposed text is better than what's there
> now.

I agree it's unrelated. I just noticed because it was in the diff ;)

> It seems like it might need a bit of wordsmithing, anyway.

Yes, looks like it could use some. It's not going to be me doing it
tho...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services