Re: Parallel Full Hash Join - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Parallel Full Hash Join
Date
Msg-id CA+hUKGLjySv=-E-4RkoFnFR-LyeKEXrwmCRc=GpLxv+FasS7zg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Full Hash Join  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Parallel Full Hash Join
List pgsql-hackers
On Tue, Sep 22, 2020 at 8:49 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> On Wed, Sep 11, 2019 at 11:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> 1.  You could probably make it so that the PHJ_BATCH_SCAN_INNER phase
>> in this patch (the scan for unmatched tuples) is executed by only one
>> process, using the "detach-and-see-if-you-were-last" trick.  Melanie
>> proposed that for an equivalent problem in the looping hash join.  I
>> think it probably works, but it gives up a lot of parallelism and thus
>> won't scale as nicely as the attached patch.
>
> I have attached a patch which implements this
> (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-buckets.patch).

Hi Melanie,

Thanks for working on this!  I have a feeling this is going to be much
easier to land than the mighty hash loop patch.  And it's good to get
one of our blocking design questions nailed down for both patches.

I took it for a very quick spin and saw simple cases working nicely,
but TPC-DS queries 51 and 97 (which contain full joins) couldn't be
convinced to use it.  Hmm.

> For starters, in order to support parallel FOJ and ROJ, I re-enabled
> setting the match bit for the tuples in the hashtable which
> 3e4818e9dd5be294d97c disabled. I did so using the code suggested in [1],
> reading the match bit to see if it is already set before setting it.

Cool.  I'm quite keen to add a "fill_inner" parameter for
ExecHashJoinImpl() and have an N-dimensional lookup table of
ExecHashJoin variants, so that this and much other related branching
can be constant-folded out of existence by the compiler in common
cases, which is why I think this is all fine, but that's for another
day...

> Then, workers except for the last worker detach after exhausting the
> outer side of a batch, leaving one worker to proceed to HJ_FILL_INNER
> and do the scan of the hash table and emit unmatched inner tuples.

+1

Doing better is pretty complicated within our current execution model,
and I think this is a good compromise for now.

Costing for uneven distribution is tricky; depending on your plan
shape, specifically whether there is something else to do afterwards
to pick up the slack, it might or might not affect the total run time
of the query.  It seems like there's not much we can do about that.

> I have also attached a variant on this patch which I am proposing to
> replace it (v1-0001-Parallel-FOJ-ROJ-single-worker-scan-chunks.patch)
> which has a new ExecParallelScanHashTableForUnmatched() in which the
> single worker doing the unmatched scan scans one HashMemoryChunk at a
> time and then frees them as it goes. I thought this might perform better
> than the version which uses the buckets because 1) it should do a bit
> less pointer chasing and 2) it frees each chunk of the hash table as it
> scans it which (maybe) would save a bit of time during
> ExecHashTableDetachBatch() when it goes through and frees the hash
> table, but, my preliminary tests showed a negligible difference between
> this and the version using buckets. I will do a bit more testing,
> though.

+1

I agree that it's the better of those two options.

>> [stuff about deadlocks]
>
> The leader exclusion tactics and the spooling idea don't solve the
> execution order deadlock possibility, so, this "all except last detach
> and last does unmatched inner scan" seems like the best way to solve
> both types of deadlock.

Agreed (at least as long as our threads of query execution are made
out of C call stacks and OS processes that block).

>> Some other notes on the patch:
>>
>> Aside from the deadlock problem, there are some minor details to tidy
>> up (handling of late starters probably not quite right, rescans not
>> yet considered).
>
> These would not be an issue with only one worker doing the scan but
> would have to be handled in a potential new parallel-enabled solution
> like I suggested above.

Makes sense.  Not sure why I thought anything special was needed for rescans.

>> There is a fun hard-coded parameter that controls
>> the parallel step size in terms of cache lines for the unmatched scan;
>> I found that 8 was a lot faster than 4, but no slower than 128 on my
>> laptop, so I set it to 8.
>
> I didn't add this cache line optimization to my chunk scanning method. I
> could do so. Do you think it is more relevant, less relevant, or the
> same if only one worker is doing the unmatched inner scan?

Yeah it's irrelevant for a single process, and even more irrelevant if
we go with your chunk-based version.

>> More thoughts along those micro-optimistic
>> lines: instead of match bit in the header, you could tag the pointer
>> and sometimes avoid having to follow it, and you could prefetch next
>> non-matching tuple's cacheline by looking a head a bit.
>
> I would be happy to try doing this once we get the rest of the patch
> ironed out so that seeing how much of a performance difference it makes
> is more straightforward.

Ignore that, I have no idea if the maintenance overhead for such an
every-tuple-in-this-chain-is-matched tag bit would be worth it, it was
just an idle thought.  I think your chunk-scan plan seems sensible for
now.

From a quick peek:

+/*
+ * Upon arriving at the barrier, if this worker is not the last
worker attached,
+ * detach from the barrier and return false. If this worker is the last worker,
+ * remain attached and advance the phase of the barrier, return true
to indicate
+ * you are the last or "elected" worker who is still attached to the barrier.
+ * Another name I considered was BarrierUniqueify or BarrierSoloAssign
+ */
+bool
+BarrierDetachOrElect(Barrier *barrier)

I tried to find some existing naming in writing about
barriers/phasers, but nothing is jumping out at me.  I think a lot of
this stuff comes from super computing where I guess "make all of the
threads give up except one" isn't a primitive they'd be too excited
about :-)

BarrierArriveAndElectOrDetach()... gah, no.

+    last = BarrierDetachOrElect(&batch->batch_barrier);

I'd be nice to add some assertions after that, in the 'last' path,
that there's only one participant and that the phase is as expected,
just to make it even clearer to the reader, and a comment in the other
path that we are no longer attached.

+    hjstate->hj_AllocatedBucketRange = 0;
...
+    pg_atomic_uint32 bucket;    /* bucket allocator for unmatched inner scan */
...
+                //volatile int mybp = 0; while (mybp == 0)

Some leftover fragments of the bucket-scan version and debugging stuff.



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Index Skip Scan (new UniqueKeys)
Next
From: Greg Nancarrow
Date:
Subject: Parallel INSERT (INTO ... SELECT ...)