Re: Parallel tuplesort, partitioning, merging, and the future - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel tuplesort, partitioning, merging, and the future
Date
Msg-id CA+TgmoagR0v5TF7a33HHAqxaFPLDkgL55eeHwUONAVjg4Jr8rw@mail.gmail.com
Whole thread Raw
In response to Parallel tuplesort, partitioning, merging, and the future  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Parallel tuplesort, partitioning, merging, and the future
Re: Parallel tuplesort, partitioning, merging, and the future
List pgsql-hackers
On Mon, Aug 8, 2016 at 3:44 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I don't think partitioning is urgent for CREATE INDEX, and may be
> inappropriate for CREATE INDEX under any circumstances, because:
>
> * Possible problems with parallel infrastructure and writes.
> * Unbalanced B-Trees (or the risk thereof).
> * What I've come up with is minimally divergent from the existing
> approach to tuplesorting.

My view on this - currently anyway - is that we shouldn't conflate the
tuplesort with the subsequent index generation, but that we should try
to use parallelism within the tuplesort itself to the greatest extent
possible.  If there is a single output stream that the leader uses to
generate the final index, then none of the above problems arise.  They
only arise if you've got multiple processes actually writing to the
index.

> Suggested partitioning algorithm
> ================================
>
> I think a hybrid partitioning + merging approach would work well for
> us. The paper "Parallel Sorting on a Shared-Nothing Architecture using
> Probabilistic Splitting" [3] has influenced my thinking here (this was
> written by prominent researchers from the influential UW-Madison
> Wisconsin database group). Currently, I have in mind something that is
> closer to what they call exact splitting to what they call
> probabilistic splitting, because I don't think it's going to be
> generally possible to have good statistics on partition boundaries
> immediately available (e.g., through something like their
> probabilistic splitting sampling the relation ahead of time).
>
> The basic idea I have in mind is that we create runs in workers in the
> same way that the parallel CREATE INDEX patch does (one output run per
> worker). However, rather than merging in the leader, we use a
> splitting algorithm to determine partition boundaries on-the-fly. The
> logical tape stuff then does a series of binary searches to find those
> exact split points within each worker's "final" tape. Each worker
> reports the boundary points of its original materialized output run in
> shared memory. Then, the leader instructs workers to "redistribute"
> slices of their final runs among each other, by changing the tapeset
> metadata to reflect that each worker has nworker input tapes with
> redrawn offsets into a unified BufFile. Workers immediately begin
> their own private on-the-fly merges.

Yeah, this is pretty cool.  You end up with the final merge segmented
into N submerges producing nonoverlapping ranges.  So you could have
the leader perform submerge 0 itself, and while it's doing that the
other workers can perform submerges 1..N.  By the time  the leader
finishes submerge 0, the remaining submerges will likely be complete
and after that the leader can just read the outputs of those submerges
one after another and it has basically no additional work to do.

It might be a good idea to divide the work into a number of submerges
substantially greater than the number of workers.  For example,
suppose we expect between 1 and 4 workers, but we partition the work
into 64 submerges.  The leader claims submerge 0, which is only 1/64
of the total.  By the time it finishes consuming those tuples,
submerge 1 will likely be done.  Hopefully, even if there are only 1
or 2 workers, they can keep ahead of the leader so that very little of
the merging happens in the leader.  Also, if some submerges go faster
than others, the distribution of work among workers remains even,
because the ones that go quicker will handle more of the submerges and
the ones that go slower will handle fewer.

I think that last part is a very important property; my intuition is
that dividing up the work between cooperating processes in a way that
should come out equal will often fail to do so, either due to the
operating system scheduler or due to some data being cached and other
data not being cached or due to the comparator running faster on some
data than other data or due to NUMA effects that make some processes
run faster than others or due to any number of other causes.  So I
think that algorithms that allocate the work dynamically are going to
greatly outperform those that use a division of labor which is fixed
at the beginning of a computation phase.

> Clearly it's really hard to be sure that this is the right thing at
> this point, but my intuition is that this is the way to go (while
> avoiding anything like this for CREATE INDEX). I'd like to know how
> others feel about it.

The number of others weighing in on these topics is surely less than
either of us would like, but hopefully we can find a way to make
progress anyhow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Proposal for CSN based snapshots
Next
From: Claudio Freire
Date:
Subject: Re: Parallel tuplesort, partitioning, merging, and the future