Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAA4eK1J0Lkbe+qUpD2ZOpAaSVacLKH8nsTDuELZ5FU+f5Xd12A@mail.gmail.com
Whole thread Raw
In response to Re: Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Thu, Oct 20, 2016 at 12:03 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Gather Merge can't emit a tuple unless it has buffered at least one
>> tuple from every producer; otherwise, the next tuple it receives from
>> one of those producers might proceed whichever tuple it chooses to
>> emit.

Right. Now, after again looking at Gather Merge patch, I think I can
better understand how it performs merging.

>>  However, it doesn't need to wait until all of the workers are
>> completely done.  The leader only needs to be at least slightly ahead
>> of the slowest worker.  I'm not sure how that compares to Peter's
>> approach.
>
> I don't think that eager merging will prove all that effective,
> however it's implemented. I see a very I/O bound system when parallel
> CREATE INDEX merges serially. There is no obvious reason why you'd
> have a straggler worker process with CREATE INDEX, really.
>
>> What I'm worried about is that we're implementing two separate systems
>> to do the same thing, and that the parallel sort approach is actually
>> a lot less general.  I think it's possible to imagine a Parallel Sort
>> implementation which does things Gather Merge can't.  If all of the
>> workers collaborate to sort all of the data rather than each worker
>> sorting its own data, then you've got something which Gather Merge
>> can't match.  But this is not that.
>
> It's not that yet, certainly. I think I've sketched a path forward for
> making partitioning a part of logtape.c that is promising. The sharing
> of ranges within tapes and so on will probably have a significant
> amount in common with what I've come up with.
>
> I don't think that any executor infrastructure is a particularly good
> model when *batch output* is needed -- the tuple queue mechanism will
> be a significant bottleneck, particularly because it does not
> integrate read-ahead, etc.
>

Tuple queue mechanism might not be super-efficient for *batch output*
(cases where many tuples needs to be read and written), but I see no
reason why it will be slower than disk I/O which I think you are using
in the patch.  IIUC, in the patch each worker including leader does
the tape sort for it's share of tuples and then finally leader merges
and populates the index.  I am not sure if the mechanism used in patch
can be useful as compare to using tuple queue, if the workers can
finish their part of sorting in-memory.

>
> The bottom line is that it's inherently difficult for me to refute the
> idea that Gather Merge could do just as well as what I have here,
> because proving that involves adding a significant amount of new
> infrastructure (e.g., to teach the executor about IndexTuples).
>

I think, there could be a simpler way, like we can force the gather
merge node when all the tuples needs to be sorted and compute the time
till it merges all tuples.  Similarly, with your patch, we can wait
till final merge is completed.  However, after doing initial study of
both the patches, I feel one can construct cases where Gather Merge
can win and also there will be cases where your patch can win.  In
particular, the Gather Merge can win where workers needs to perform
sort mostly in-memory.  I am not sure if it's easy to get best of both
the worlds.

Your patch needs rebase and I noticed one warning.
sort\logtape.c(1422): warning C4700: uninitialized local variable 'lt' used

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "Constantin S. Pan"
Date:
Subject: Re: Fun fact about autovacuum and orphan temp tables
Next
From: Amit Kapila
Date:
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)