Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | 3720c2e1-a41f-1df1-4c63-c875cc97afe7@iki.fi Whole thread Raw |
In response to | Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On 08/02/2016 01:18 AM, Peter Geoghegan wrote: > Tape unification > ---------------- > > Sort operations have a unique identifier, generated before any workers > are launched, using a scheme based on the leader's PID, and a unique > temp file number. This makes all on-disk state (temp files managed by > logtape.c) discoverable by the leader process. State in shared memory > is sized in proportion to the number of workers, so the only thing > about the data being sorted that gets passed around in shared memory > is a little logtape.c metadata for tapes, describing for example how > large each constituent BufFile is (a BufFile associated with one > particular worker's tapeset). > > (See below also for notes on buffile.c's role in all of this, fd.c and > resource management, etc.) > > ... > > buffile.c, and "unification" > ============================ > > There has been significant new infrastructure added to make logtape.c > aware of workers. buffile.c has in turn been taught about unification > as a first class part of the abstraction, with low-level management of > certain details occurring within fd.c. So, "tape unification" within > processes to open other backend's logical tapes to generate a unified > logical tapeset for the leader to merge is added. This is probably the > single biggest source of complexity for the patch, since I must > consider: > > * Creating a general, reusable abstraction for other possible BufFile > users (logtape.c only has to serve tuplesort.c, though). > > * Logical tape free space management. > > * Resource management, file lifetime, etc. fd.c resource management > can now close a file at xact end for temp files, while not deleting it > in the leader backend (only the "owning" worker backend deletes the > temp file it owns). > > * Crash safety (e.g., when to truncate existing temp files, and when not to). I find this unification business really complicated. I think it'd be simpler to keep the BufFiles and LogicalTapeSets separate, and instead teach tuplesort.c how to merge tapes that live on different LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single LogicalTapeSet can contain tapes from different underlying BufFiles. What I have in mind is something like the attached patch. It refactors LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a LogicalTape as argument, instead of LogicalTapeSet and tape number. LogicalTapeSet doesn't have the concept of a tape number anymore, it can contain any number of tapes, and you can create more on the fly. With that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that came from different tape sets, backed by different BufFiles. I think that'd avoid much of the unification code. That leaves one problem, though: reusing space in the final merge phase. If the tapes being merged belong to different LogicalTapeSets, and create one new tape to hold the result, the new tape cannot easily reuse the space of the input tapes because they are on different tape sets. But looking at your patch, ISTM you actually dodged that problem as well: > + * As a consequence of only being permitted to write to the leader > + * controlled range, parallel sorts that require a final materialized tape > + * will use approximately twice the disk space for temp files compared to > + * a more or less equivalent serial sort. This is deemed acceptable, > + * since it is far rarer in practice for parallel sort operations to > + * require a final materialized output tape. Note that this does not > + * apply to any merge process required by workers, which may reuse space > + * eagerly, just like conventional serial external sorts, and so > + * typically, parallel sorts consume approximately the same amount of disk > + * blocks as a more or less equivalent serial sort, even when workers must > + * perform some merging to produce input to the leader. I'm slightly worried about that. Maybe it's OK for a first version, but it'd be annoying in a query where a sort is below a merge join, for example, so that you can't do the final merge on the fly because mark/restore support is needed. One way to fix that would be have all the parallel works share the work files to begin with, and keep the "nFileBlocks" value in shared memory so that the workers won't overlap each other. Then all the blocks from different workers would be mixed together, though, which would hurt the sequential pattern of the tapes, so each workers would need to allocate larger chunks to avoid that. - Heikki
Attachment
pgsql-hackers by date: