Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAM3SWZQKGDkH-qFj1zEuTcb=AHXGh=59vCe88PVYGiY3zGy7uw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel tuplesort (for parallel B-Tree index creation)  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Mon, Oct 17, 2016 at 5:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, but what is the proof or why do you think second is going to
> better than first?

I don't have proof. It's my opinion that it probably would be, based
on partial information, and my intuition. It's hard to prove something
like that, because it's not really clear what that alternative would
look like. Also, finding all of that out would take a long time --
it's hard to prototype. Do tuple table slots need to care about
IndexTuples now? What does that even look like? What existing executor
code needs to be taught about this new requirement?

> One thing which strikes as a major difference
> between your approach and Gather Merge is that in your approach leader
> has to wait till all the workers have done with their work on sorting
> whereas with Gather Merge as soon as first one is done, leader starts
> with merging.  I could be wrong here, but if I understood it
> correctly, then there is a argument that Gather Merge kind of approach
> can win in cases where some of the workers can produce sorted outputs
> ahead of others and I am not sure if we can dismiss such cases.

How can it? You need to have at least one tuple from every worker
(before the worker has exhausted its supply of output tuples) in order
to merge to return the next tuple to the top level consumer (the thing
above the Gather Merge). If you're talking about "eager vs. lazy
merging", please see my previous remarks on that, on this thread. (In
any case, whether we merge more eagerly seems like an orthogonal
question to the one you ask.)

The first thing to note about my approach is that I openly acknowledge
that this parallel CREATE INDEX patch is not much use for parallel
query. I have only generalized tuplesort.c to support parallelizing a
sort operation. I think that parallel query needs partitioning to push
down parts of a sort to workers, with little or no need for them to be
funneled together at the end, since most tuples are eliminated before
being passed to the Gather/Gather Merge node. The partitioning part is
really hard.

I guess that Gather Merge nodes have value because they allow us to
preserve the sorted-ness of a parallel path, which might be most
useful because it enables things elsewhere. But, that doesn't really
recommend making Gather Merge nodes good at batch processing a large
number of tuples, I suspect. (One problem with the tuple queue
mechanism is that it can be a big bottleneck -- that's why we want to
eliminate most tuples before they're passed up to the leader, in the
case of parallel sequential scan in 9.6.)

I read the following paragraph from the Volcano paper just now:

"""
During implementation and benchmarking of parallel sorting, we added
two more features to exchange. First, we wanted to implement a merge
network in which some processors produce sorted streams merge
concurrently by other processors. Volcano’s sort iterator can be used
to generate a sorted stream. A merge iterator was easily derived from
the sort module. It uses a single level merge, instead of the cascaded
merge of runs used in sort. The input of a merge iterator is an
exchange. Differently from other operators, the merge iterator
requires to distinguish the input records by their producer. As an
example, for a join operation it does not matter where the input
records were created, and all inputs can be accumulated in a single
input stream. For a merge operation, it is crucial to distinguish the
input records by their producer in order to merge multiple sorted
streams correctly.
"""

I don't really understand this paragraph, but thought I'd ask: why the
need to "distinguish the input records by their producer in order to
merge multiple sorted streams correctly"? Isn't that talking about
partitioning, where each workers *ownership* of a range matters? My
patch doesn't care which values belong to which workers. And, it
focuses quite a lot on dealing well with the memory bandwidth bound,
I/O bound part of the sort where we write out the index itself, just
by piggy-backing on tuplesort.c. I don't think that that's useful for
a general-purpose executor node -- tuple-at-a-time processing when
fetching from workers would kill performance.

> By looking at 'workersFinished' usage, it looks like you have devised
> a new way for leader to know when workers have finished which might be
> required for this patch.  However, have you tried to use or
> investigate if existing infrastructure which serves same purpose could
> be used for it?

Yes, I have. I think that Robert's "condition variables" patch would
offer a general solution to what I've devised.

What I have there is, as you say, fairly ad-hoc, even though my
requirements are actually fairly general. I was actually annoyed that
there wasn't an easier way to do that myself. Robert has said that he
won't commit his "condition variables" work until it's clear that
there will be some use for the facility. Well, I'd use it for this
patch, if I could. Robert?

--
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Yury Zhuravlev
Date:
Subject: Re: Add PGDLLEXPORT to PG_FUNCTION_INFO_V1
Next
From: Stephen Frost
Date:
Subject: Re: [ADMIN] 9.5 new setting "cluster name" and logging