Re: Parallel Full Hash Join - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Parallel Full Hash Join
Date
Msg-id CAAKRu_YX8PdxHbqV4wEk6v-ivMTwN+tcYG8AtTe3HT3gz=ewmA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Full Hash Join  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Parallel Full Hash Join
List pgsql-hackers
On Fri, Nov 26, 2021 at 3:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Nov 21, 2021 at 4:48 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Wed, Nov 17, 2021 at 01:45:06PM -0500, Melanie Plageman wrote:
> > > Yes, this looks like that issue.
> > >
> > > I've attached a v8 set with the fix I suggested in [1] included.
> > > (I added it to 0001).
> >
> > This is still crashing :(
> > https://cirrus-ci.com/task/6738329224871936
> > https://cirrus-ci.com/task/4895130286030848
>
> I added a core file backtrace to cfbot's CI recipe a few days ago, so
> now we have:
>
> https://cirrus-ci.com/task/5676480098205696
>
> #3 0x00000000009cf57e in ExceptionalCondition (conditionName=0x29cae8
> "BarrierParticipants(&accessor->shared->batch_barrier) == 1",
> errorType=<optimized out>, fileName=0x2ae561 "nodeHash.c",
> lineNumber=lineNumber@entry=2224) at assert.c:69
> No locals.
> #4 0x000000000071575e in ExecParallelScanHashTableForUnmatched
> (hjstate=hjstate@entry=0x80a60a3c8,
> econtext=econtext@entry=0x80a60ae98) at nodeHash.c:2224

I believe this assert can be safely removed.

It is possible for a worker to attach to the batch barrier after the
"last" worker was elected to scan and emit unmatched inner tuples. This
is safe because the batch barrier is already in phase PHJ_BATCH_SCAN
and this newly attached worker will simply detach from the batch
barrier and look for a new batch to work on.

The order of events would be as follows:

W1: advances batch to PHJ_BATCH_SCAN
W2: attaches to batch barrier in ExecParallelHashJoinNewBatch()
W1: calls ExecParallelScanHashTableForUnmatched() (2 workers attached to
barrier at this point)
W2: detaches from the batch barrier

The attached v10 patch removes this assert and updates the comment in
ExecParallelScanHashTableForUnmatched().

I'm not sure if I should add more detail about this scenario in
ExecParallelHashJoinNewBatch() under PHJ_BATCH_SCAN or if the detail in
ExecParallelScanHashTableForUnmatched() is sufficient.

- Melanie

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: row filtering for logical replication
Next
From: Andrew Dunstan
Date:
Subject: Re: Windows crash / abort handling