Re: [HACKERS] WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] WIP: [[Parallel] Shared] Hash |
Date | |
Msg-id | CAEepm=3g=dMG+84083fkFzLvgMJ7HdhbGB=AeZABNukbZm3hpA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] WIP: [[Parallel] Shared] Hash (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] WIP: [[Parallel] Shared] Hash
Re: [HACKERS] WIP: [[Parallel] Shared] Hash |
List | pgsql-hackers |
On Wed, Mar 1, 2017 at 10:40 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I'm testing a new version which incorporates feedback from Andres and > Ashutosh, and is refactored to use a new SharedBufFileSet component to > handle batch files, replacing the straw-man implementation from the v5 > patch series. I've set this to waiting-on-author and will post v6 > tomorrow. I created a system for reference counted partitioned temporary files called SharedBufFileSet: see 0007-hj-shared-buf-file.patch. Then I ripped out the code for sharing batch files that I previously had cluttering up nodeHashjoin.c, and refactored it into a new component called a SharedTuplestore which wraps a SharedBufFileSet and gives it a tuple-based interface: see 0008-hj-shared-tuplestore.patch. The name implies aspirations of becoming a more generally useful shared analogue of tuplestore, but for now it supports only the exact access pattern needed for hash join batches ($10 wrench). It creates temporary files like this: base/pgsql_tmp/pgsql_tmp[pid].[set].[partition].[participant].[segment] I'm not sure why nodeHashjoin.c is doing raw batchfile read/write operations anyway; why not use tuplestore.c for that (as tuplestore.c's comments incorrectly say is the case)? Maybe because Tuplestore's interface doesn't support storing the extra hash value. In SharedTuplestore I solved that problem by introducing an optional fixed sized piece of per-tuple meta-data. Another thing that is different about SharedTuplestore is that it supports partitions, which is convenient for this project and probably other parallel projects too. In order for workers to be able to participate in reference counting schemes based on DSM segment lifetime, I had to give the Exec*InitializeWorker() functions access to the dsm_segment object, whereas previously they received only the shm_toc in order to access its contents. I invented ParallelWorkerContext which has just two members 'seg' and 'toc': see 0005-hj-let-node-have-seg-in-worker.patch. I didn't touch the FDW API or custom scan API where they currently take toc, though I can see that there is an argument that they should; changing those APIs seems like a bigger deal. Another approach would be to use ParallelContext, as passed into ExecXXXInitializeDSM, with the members that are not applicable to workers zeroed out. Thoughts? I got rid of the ExecDetachXXX stuff I had invented in the last version, because acf555bc fixed the problem a better way. I found that I needed to put use more than one toc entry for a single executor node, in order to reserve space for the inner and outer SharedTuplestore objects. So I invented a way to make more extra keys with PARALLEL_KEY_EXECUTOR_NTH(plan_node_id, N). -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: