Thread: ParallelContexts can get confused about which worker is which

ParallelContexts can get confused about which worker is which

From
Robert Haas
Date:
While testing last night, I discovered a serious case of brain fade in
parallel.c; the same conceptual mistake has also spread to
nodeGather.c.  parallel.c creates an array of ParallelWorkerInfo
structures, which are defined like this:

typedef struct ParallelWorkerInfo
{       BackgroundWorkerHandle *bgwhandle;       shm_mq_handle *error_mqh;       int32           pid;
} ParallelWorkerInfo;

These structures are typically accessed as pcxt->worker[i].  The
trouble is that pcxt->worker[i].bgwhandle is the bgwhandler for the
i'th worker to be registered, while pcxt->worker[i].error_mqh is the
error_mqh for the i'th worker to attach to the dynamic shared memory
segment.  But those might be totally different processes.

This happens to mostly work, because the postmaster will probably
start the processes in the same order that they are registered, and
the operating system will probably schedule them in the same order the
postmaster starts them.  And if you only have one worker, then you're
fine!  It also seems to work out that if all workers exit cleanly, any
shuffling of the background worker handles relative to the error queue
handles is harmless.  But it's still pretty broken.

There seem to be two ways to fix this.  One is to keep the bgwhandle
objects in a separate array from the error_mqh objects and reconstruct
after the fact what the mapping between those two sets of indexes is.
This solution looks complicated to code and generally pretty annoying
to get right.  The other way to fix this is to pass down the index
that the leader assigns to any given worker, and have the worker use
that index instead of allocating its own separate index after
connecting to the DSM segment.  Unfortunately, there's not really a
good way to pass that additional information down to the worker right
now, but we could fix that pretty easily by adding an additional field
to the BackgroundWorker structure, which the worker would then be able
to access via MyBgworkerEntry.  I suggest something like this:

--- a/src/include/postmaster/bgworker.h
+++ b/src/include/postmaster/bgworker.h
@@ -74,6 +74,7 @@ typedef enum#define BGW_DEFAULT_RESTART_INTERVAL   60#define BGW_NEVER_RESTART
     -1#define BGW_MAXLEN                                             64
 
+#define BGW_EXTRALEN                                   128
typedef struct BackgroundWorker{
@@ -85,6 +86,7 @@ typedef struct BackgroundWorker       char            bgw_library_name[BGW_MAXLEN];   /* only if
bgw_main is NULL */       char            bgw_function_name[BGW_MAXLEN];  /* only if
bgw_main is NULL */       Datum           bgw_main_arg;
+       char            bgw_extra[BGW_EXTRALEN];       pid_t           bgw_notify_pid; /* SIGUSR1 this backend on
start/stop*/} BackgroundWorker;
 

The ability to pass down a little more information from the
registering process to the background worker seems like it would be
useful for more than just parallelism, so I imagine this change might
be generally welcomed.  Parallel workers would use the first four
bytes of bgw_extra to store the worker index, and other users of the
background worker facility could use it for whatever they like.

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ParallelContexts can get confused about which worker is which

From
Robert Haas
Date:
On Fri, Oct 30, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The other way to fix this is to pass down the index
> that the leader assigns to any given worker, and have the worker use
> that index instead of allocating its own separate index after
> connecting to the DSM segment.  Unfortunately, there's not really a
> good way to pass that additional information down to the worker right
> now, but we could fix that pretty easily by adding an additional field
> to the BackgroundWorker structure, which the worker would then be able
> to access via MyBgworkerEntry.

Here's a patch implementing that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ParallelContexts can get confused about which worker is which

From
Amit Kapila
Date:
On Sat, Oct 31, 2015 at 11:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Oct 30, 2015 at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > The other way to fix this is to pass down the index
> > that the leader assigns to any given worker, and have the worker use
> > that index instead of allocating its own separate index after
> > connecting to the DSM segment.  Unfortunately, there's not really a
> > good way to pass that additional information down to the worker right
> > now, but we could fix that pretty easily by adding an additional field
> > to the BackgroundWorker structure, which the worker would then be able
> > to access via MyBgworkerEntry.
>
> Here's a patch implementing that.
>

If we are going to add a new parameter to BackgroundWorker structure,
then the same needs to be updated in docs [1] as well.  I think adding
a new parameter to this structure might require some updations in
client applications. It seems worth to add a note for the same in commit
message, so that same could be reflected in Release Notes.

Also, I don't know why BGW_EXTRALEN needs to be 128 and not 64?
I guess you kept it so that in future if we need to pass more information,
then the same could be used which seems reasonable considering that
this is an exposed structure and we don't want to change it again.

/* Mutex protects remaining fields. */
slock_t mutex;


/* Maximum XactLastRecEnd of any worker. */
XLogRecPtr last_xlog_end;

} FixedParallelState;

Now as above mutex will be used to just protect last_xlog_end, do you
think it is better to modify the comment above mutex declaration?

Re: ParallelContexts can get confused about which worker is which

From
Robert Haas
Date:
On Sun, Nov 1, 2015 at 1:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> If we are going to add a new parameter to BackgroundWorker structure,
> then the same needs to be updated in docs [1] as well.

Right, good point.

> I think adding
> a new parameter to this structure might require some updations in
> client applications. It seems worth to add a note for the same in commit
> message, so that same could be reflected in Release Notes.

Client background worker modules need to be recompiled so that they
know about the new structure size, but they shouldn't need any code
changes.

> Also, I don't know why BGW_EXTRALEN needs to be 128 and not 64?
> I guess you kept it so that in future if we need to pass more information,
> then the same could be used which seems reasonable considering that
> this is an exposed structure and we don't want to change it again.

Well, I only need 4 bytes for this purpose, but I figured other users
of the background worker facility might like to have a little more.  I
don't think 64 vs. 128 matters very much; I'm happy to hear other
opinions on this topic.  What I think mostly matters is that we don't
make it so much that the overhead of copying it would add
substantially to background worker startup times.  I assume that
either 64 or 128 is safe in that regard.

> /* Mutex protects remaining fields. */
> slock_t mutex;
>
>
> /* Maximum XactLastRecEnd of any worker. */
> XLogRecPtr last_xlog_end;
>
> } FixedParallelState;
>
> Now as above mutex will be used to just protect last_xlog_end, do you
> think it is better to modify the comment above mutex declaration?

Meh.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ParallelContexts can get confused about which worker is which

From
Robert Haas
Date:
On Mon, Nov 2, 2015 at 4:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 1, 2015 at 1:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> If we are going to add a new parameter to BackgroundWorker structure,
>> then the same needs to be updated in docs [1] as well.
>
> Right, good point.

Here's an updated patch with documentation added and two bugs fixed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: ParallelContexts can get confused about which worker is which

From
Amit Kapila
Date:
On Tue, Nov 3, 2015 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 2, 2015 at 4:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Sun, Nov 1, 2015 at 1:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> If we are going to add a new parameter to BackgroundWorker structure,
> >> then the same needs to be updated in docs [1] as well.
> >
> > Right, good point.
>
> Here's an updated patch with documentation added and two bugs fixed.
>

Patch looks good to me.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: ParallelContexts can get confused about which worker is which

From
Robert Haas
Date:
On Tue, Nov 3, 2015 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 3, 2015 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Nov 2, 2015 at 4:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > On Sun, Nov 1, 2015 at 1:11 AM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >> If we are going to add a new parameter to BackgroundWorker structure,
>> >> then the same needs to be updated in docs [1] as well.
>> >
>> > Right, good point.
>>
>> Here's an updated patch with documentation added and two bugs fixed.
>>
>
> Patch looks good to me.

OK, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company