Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CA+TgmoaMUY7p54UkkS+gq_W4woN7J9LHLUydKmCTJEwCE8ZD+A@mail.gmail.com
Whole thread Raw
In response to Re: Parallel tuplesort (for parallel B-Tree index creation)  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Mon, Oct 17, 2016 at 8:36 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> This project of mine is about parallelizing tuplesort.c, which isn't
>> really what you want for parallel query -- you shouldn't try to scope
>> the problem as "make the sort more scalable using parallelism" there.
>> Rather, you want to scope it at "make the execution of the entire
>> query more scalable using parallelism", which is really quite a
>> different thing, which necessarily involves the executor having direct
>> knowledge of partition boundaries.
>
> Okay, but what is the proof or why do you think second is going to
> better than first?  One thing which strikes as a major difference
> between your approach and Gather Merge is that in your approach leader
> has to wait till all the workers have done with their work on sorting
> whereas with Gather Merge as soon as first one is done, leader starts
> with merging.  I could be wrong here, but if I understood it
> correctly, then there is a argument that Gather Merge kind of approach
> can win in cases where some of the workers can produce sorted outputs
> ahead of others and I am not sure if we can dismiss such cases.

Gather Merge can't emit a tuple unless it has buffered at least one
tuple from every producer; otherwise, the next tuple it receives from
one of those producers might proceed whichever tuple it chooses to
emit.  However, it doesn't need to wait until all of the workers are
completely done.  The leader only needs to be at least slightly ahead
of the slowest worker.  I'm not sure how that compares to Peter's
approach.

What I'm worried about is that we're implementing two separate systems
to do the same thing, and that the parallel sort approach is actually
a lot less general.  I think it's possible to imagine a Parallel Sort
implementation which does things Gather Merge can't.  If all of the
workers collaborate to sort all of the data rather than each worker
sorting its own data, then you've got something which Gather Merge
can't match.  But this is not that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Indirect indexes
Next
From: Bruce Momjian
Date:
Subject: Re: emergency outage requiring database restart