Re: WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: WIP: [[Parallel] Shared] Hash
Date
Msg-id CAEepm=2ed4WWs=DKR5iuBW=9xi2rnw-rr0++omQZqkYVaz4pow@mail.gmail.com
Whole thread Raw
In response to Re: WIP: [[Parallel] Shared] Hash  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: WIP: [[Parallel] Shared] Hash
List pgsql-hackers
On Tue, Nov 1, 2016 at 5:33 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Please find a WIP patch attached.  Everything related to batch reading
> is not currently in a working state, which breaks multi-batch joins,
> but many single batch cases work correctly.  In an earlier version I
> had multi-batch joins working but was before I started tackling
> problems 2 and 3 listed in my earlier message.

Here is a better version with code to handle multi-batch joins.  The
BufFile sharing approach to reading other participants' batch files is
a straw-man (perhaps what we really want would look more like a shared
tuplestore?), but solves the immediate problem I described earlier so
I can focus on other aspects of the problem.  There may be some issues
with cleanup though, more on that soon.

Here's a summary of how this patch chops the hash join up into phases.
The 'phase' is an integer that encodes the step we're up to in the
algorithm, including the current batch number, and I represent that
with macros like PHJ_PHASE_HASHING and PHJ_PHASE_PROBING_BATCH(42).
Each phase is either serial, meaning that one participant does
something special, or parallel meaning that all participants do the
same thing.  It goes like this:

* PHJ_PHASE_INIT
The initial phase established by the leader before launching workers.

* PHJ_PHASE_CREATING
Serial:  One participant creates the hash table.

* PHJ_PHASE_HASHING
Serial or parallel:  Depending on plan, one or all participants
execute the inner plan to completion, building the hash table for
batch 0 and possibly writing tuples to batch files on disk for future
batches.

* PHJ_PHASE_RESIZING
Serial:  One participant resizes the hash table if necessary.

* PHJ_PHASE_REBUCKETING
Parallel:  If the hash table was resized, all participants rehash all
the tuples in it and insert them into the buckets of the new larger
hash table.

* PHJ_PHASE_PROBING_BATCH(0)
Parallel:  All participants execute the outer plan to completion.  For
each tuple they either probe the hash table if it's for the current
batch, or write it out to a batch file if it's for a future batch.
For each tuple matched in the hash table, they set the matched bit.
When they are finished probing batch 0, they also preload tuples from
inner batch 1 into a secondary hash table until work_mem is exhausted
(note that at this time work_mem is occupied by the primary hash
table: this is just a way to use any remaining work_mem and extract a
little bit more parallelism, since otherwise every participant would
be waiting for all participants to finish probing; instead we wait for
all paticipants to finish probing AND for spare work_mem to run out).

* PHJ_PHASE_UNMATCHED_BATCH(0)
Parallel:  For right/full joins, all participants then scan the hash
table looking for unmatched tuples.

... now we are ready for batch 1 ...

* PHJ_PHASE_PROMOTING_BATCH(1)
Serial:  One participant promotes the secondary hash table to become
the new primary hash table.

* PHJ_PHASE_LOADING_BATCH(1)
Parallel:  All participants finish loading inner batch 1 into the hash
table (work that was started in the previous probing phase).

* PHJ_PHASE_PREPARING_BATCH(1)
Serial:  One participant resets the batch reading heads, so that we
are ready to read from outer batch 1, and inner batch 2.

* PHJ_PHASE_PROBING_BATCH(1)
Parallel:  All participants read from outer batch 1 to probe the hash
table, then read from inner batch 2 to preload tuples into the
secondary hash table.

* PHJ_PHASE_UNMATCHED_BATCH(1)
Parallel:  For right/full joins, all participants then scan the hash
table looking for unmatched tuples.

... now we are ready for batch 2 ...

Then all participants synchronise a final time to enter batch
PHJ_PHASE_PROMOTING_BATCH(nbatch), which is one past the end and is
the point at which it is safe to clean up.  (There may be an
optimisation where I can clean up after the last participant detaches
instead, more on that soon).

Obviously I'm actively working on developing and stabilising all this.
Some of the things I'm working on are: work_mem accounting, batch
increases, rescans and figuring out if the resource management for
those BufFiles is going to work.  There are quite a lot of edge cases
some of which I'm still figuring out, but I feel like this approach is
workable.  At this stage I want to share what I'm doing to see if
others have feedback, ideas, blood curdling screams of horror, etc.  I
will have better patches and a set of test queries soon.  Thanks for
reading.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Haribabu Kommi
Date:
Subject: Re: [BUGS] BUG #14350: VIEW with INSTEAD OF INSERT TRIGGER and COPY. Missing feature or working as designed.
Next
From: Dilip Kumar
Date:
Subject: Re: [BUGS] BUG #14350: VIEW with INSTEAD OF INSERT TRIGGER and COPY. Missing feature or working as designed.