Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id 3720c2e1-a41f-1df1-4c63-c875cc97afe7@iki.fi
Whole thread Raw
In response to Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Tape unification
> ----------------
>
> Sort operations have a unique identifier, generated before any workers
> are launched, using a scheme based on the leader's PID, and a unique
> temp file number. This makes all on-disk state (temp files managed by
> logtape.c) discoverable by the leader process. State in shared memory
> is sized in proportion to the number of workers, so the only thing
> about the data being sorted that gets passed around in shared memory
> is a little logtape.c metadata for tapes, describing for example how
> large each constituent BufFile is (a BufFile associated with one
> particular worker's tapeset).
>
> (See below also for notes on buffile.c's role in all of this, fd.c and
> resource management, etc.)
>
 > ...
>
> buffile.c, and "unification"
> ============================
>
> There has been significant new infrastructure added to make logtape.c
> aware of workers. buffile.c has in turn been taught about unification
> as a first class part of the abstraction, with low-level management of
> certain details occurring within fd.c. So, "tape unification" within
> processes to open other backend's logical tapes to generate a unified
> logical tapeset for the leader to merge is added. This is probably the
> single biggest source of complexity for the patch, since I must
> consider:
>
> * Creating a general, reusable abstraction for other possible BufFile
> users (logtape.c only has to serve tuplesort.c, though).
>
> * Logical tape free space management.
>
> * Resource management, file lifetime, etc. fd.c resource management
> can now close a file at xact end for temp files, while not deleting it
> in the leader backend (only the "owning" worker backend deletes the
> temp file it owns).
>
> * Crash safety (e.g., when to truncate existing temp files, and when not to).

I find this unification business really complicated. I think it'd be
simpler to keep the BufFiles and LogicalTapeSets separate, and instead
teach tuplesort.c how to merge tapes that live on different
LogicalTapeSets/BufFiles. Or refactor LogicalTapeSet so that a single
LogicalTapeSet can contain tapes from different underlying BufFiles.

What I have in mind is something like the attached patch. It refactors
LogicalTapeRead(), LogicalTapeWrite() etc. functions to take a
LogicalTape as argument, instead of LogicalTapeSet and tape number.
LogicalTapeSet doesn't have the concept of a tape number anymore, it can
contain any number of tapes, and you can create more on the fly. With
that, it'd be fairly easy to make tuplesort.c merge LogicalTapes that
came from different tape sets, backed by different BufFiles. I think
that'd avoid much of the unification code.

That leaves one problem, though: reusing space in the final merge phase.
If the tapes being merged belong to different LogicalTapeSets, and
create one new tape to hold the result, the new tape cannot easily reuse
the space of the input tapes because they are on different tape sets.
But looking at your patch, ISTM you actually dodged that problem as well:

> +     * As a consequence of only being permitted to write to the leader
> +     * controlled range, parallel sorts that require a final materialized tape
> +     * will use approximately twice the disk space for temp files compared to
> +     * a more or less equivalent serial sort.  This is deemed acceptable,
> +     * since it is far rarer in practice for parallel sort operations to
> +     * require a final materialized output tape.  Note that this does not
> +     * apply to any merge process required by workers, which may reuse space
> +     * eagerly, just like conventional serial external sorts, and so
> +     * typically, parallel sorts consume approximately the same amount of disk
> +     * blocks as a more or less equivalent serial sort, even when workers must
> +     * perform some merging to produce input to the leader.

I'm slightly worried about that. Maybe it's OK for a first version, but
it'd be annoying in a query where a sort is below a merge join, for
example, so that you can't do the final merge on the fly because
mark/restore support is needed.

One way to fix that would be have all the parallel works share the work
files to begin with, and keep the "nFileBlocks" value in shared memory
so that the workers won't overlap each other. Then all the blocks from
different workers would be mixed together, though, which would hurt the
sequential pattern of the tapes, so each workers would need to allocate
larger chunks to avoid that.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Oskari Saarenmaa
Date:
Subject: Re: Hash Indexes
Next
From: Alvaro Herrera
Date:
Subject: Re: New SQL counter statistics view (pg_stat_sql)