Re: WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: WIP: [[Parallel] Shared] Hash |
Date | |
Msg-id | CAEepm=2ed4WWs=DKR5iuBW=9xi2rnw-rr0++omQZqkYVaz4pow@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: [[Parallel] Shared] Hash (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: WIP: [[Parallel] Shared] Hash
|
List | pgsql-hackers |
On Tue, Nov 1, 2016 at 5:33 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Please find a WIP patch attached. Everything related to batch reading > is not currently in a working state, which breaks multi-batch joins, > but many single batch cases work correctly. In an earlier version I > had multi-batch joins working but was before I started tackling > problems 2 and 3 listed in my earlier message. Here is a better version with code to handle multi-batch joins. The BufFile sharing approach to reading other participants' batch files is a straw-man (perhaps what we really want would look more like a shared tuplestore?), but solves the immediate problem I described earlier so I can focus on other aspects of the problem. There may be some issues with cleanup though, more on that soon. Here's a summary of how this patch chops the hash join up into phases. The 'phase' is an integer that encodes the step we're up to in the algorithm, including the current batch number, and I represent that with macros like PHJ_PHASE_HASHING and PHJ_PHASE_PROBING_BATCH(42). Each phase is either serial, meaning that one participant does something special, or parallel meaning that all participants do the same thing. It goes like this: * PHJ_PHASE_INIT The initial phase established by the leader before launching workers. * PHJ_PHASE_CREATING Serial: One participant creates the hash table. * PHJ_PHASE_HASHING Serial or parallel: Depending on plan, one or all participants execute the inner plan to completion, building the hash table for batch 0 and possibly writing tuples to batch files on disk for future batches. * PHJ_PHASE_RESIZING Serial: One participant resizes the hash table if necessary. * PHJ_PHASE_REBUCKETING Parallel: If the hash table was resized, all participants rehash all the tuples in it and insert them into the buckets of the new larger hash table. * PHJ_PHASE_PROBING_BATCH(0) Parallel: All participants execute the outer plan to completion. For each tuple they either probe the hash table if it's for the current batch, or write it out to a batch file if it's for a future batch. For each tuple matched in the hash table, they set the matched bit. When they are finished probing batch 0, they also preload tuples from inner batch 1 into a secondary hash table until work_mem is exhausted (note that at this time work_mem is occupied by the primary hash table: this is just a way to use any remaining work_mem and extract a little bit more parallelism, since otherwise every participant would be waiting for all participants to finish probing; instead we wait for all paticipants to finish probing AND for spare work_mem to run out). * PHJ_PHASE_UNMATCHED_BATCH(0) Parallel: For right/full joins, all participants then scan the hash table looking for unmatched tuples. ... now we are ready for batch 1 ... * PHJ_PHASE_PROMOTING_BATCH(1) Serial: One participant promotes the secondary hash table to become the new primary hash table. * PHJ_PHASE_LOADING_BATCH(1) Parallel: All participants finish loading inner batch 1 into the hash table (work that was started in the previous probing phase). * PHJ_PHASE_PREPARING_BATCH(1) Serial: One participant resets the batch reading heads, so that we are ready to read from outer batch 1, and inner batch 2. * PHJ_PHASE_PROBING_BATCH(1) Parallel: All participants read from outer batch 1 to probe the hash table, then read from inner batch 2 to preload tuples into the secondary hash table. * PHJ_PHASE_UNMATCHED_BATCH(1) Parallel: For right/full joins, all participants then scan the hash table looking for unmatched tuples. ... now we are ready for batch 2 ... Then all participants synchronise a final time to enter batch PHJ_PHASE_PROMOTING_BATCH(nbatch), which is one past the end and is the point at which it is safe to clean up. (There may be an optimisation where I can clean up after the last participant detaches instead, more on that soon). Obviously I'm actively working on developing and stabilising all this. Some of the things I'm working on are: work_mem accounting, batch increases, rescans and figuring out if the resource management for those BufFiles is going to work. There are quite a lot of edge cases some of which I'm still figuring out, but I feel like this approach is workable. At this stage I want to share what I'm doing to see if others have feedback, ideas, blood curdling screams of horror, etc. I will have better patches and a set of test queries soon. Thanks for reading. -- Thomas Munro http://www.enterprisedb.com
Attachment
pgsql-hackers by date:
Previous
From: Haribabu KommiDate:
Subject: Re: [BUGS] BUG #14350: VIEW with INSTEAD OF INSERT TRIGGER and COPY. Missing feature or working as designed.
Next
From: Dilip KumarDate:
Subject: Re: [BUGS] BUG #14350: VIEW with INSTEAD OF INSERT TRIGGER and COPY. Missing feature or working as designed.