Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAM3SWZRhBhiknTF_=NjDSnNZ11hx=U_SEYwbc5vd=x7M4mMiCw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Wed, Aug 3, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Since merging is a big bottleneck with this, we should probably also
> work to address that indirectly.

I attach a patch that changes how we maintain the heap invariant
during tuplesort merging. I already mentioned this over on the
"Parallel tuplesort, partitioning, merging, and the future" thread. As
noted already on that thread, this patch makes merging clustered
numeric input about 2.1x faster overall in one case, which is
particularly useful in the context of a serial final/leader merge
during a parallel CREATE INDEX. Even *random* non-C-collated text
input is made significantly faster. This work is totally orthogonal to
parallelism, though; it's just very timely, given our discussion of
the merge bottleneck on this thread.

If I benchmark a parallel build of a 100 million row index, with
presorted input, I can see a 71% reduction in *comparisons* with 8
tapes/workers, and an 80% reduction in comparisons with 16
workers/tapes in one instance (the numeric case I just mentioned).
With random input, we can still come out significantly ahead, but not
to the same degree. I was able to see a reduction in comparisons
during a leader merge, from 1,468,522,397 comparisons to 999,755,569
comparisons, which is obviously still quite significant (worker
merges, if any, benefit too). I think I need to redo my parallel
CREATE INDEX benchmark, so that you can take this into account. Also,
I think that this patch will make very large external sorts that
naturally have tens of runs to merge significantly faster, but I
didn't bother to benchmark that.

The patch is intended to be applied on top of parallel B-Tree patches
0001-* and 0002-* [1]. I happened to test it with parallelism, but
these are all independently useful, and will be entered as a separate
CF entry (perhaps better to commit the earlier two patches first, to
avoid merge conflicts). I'm optimistic that we can get those 3 patches
in the series out of the way early, without blocking on discussing
parallel sort.

The patch makes tuplesort merging shift down and displace the root
tuple with the tape's next preread tuple, rather than compacting and
then inserting into the heap anew. This approach to maintaining the
heap as tuples are returned to caller will always produce fewer
comparisons overall. The new approach is also simpler. We were already
shifting down to compact the heap within the misleadingly named [2]
function tuplesort_heap_siftup() -- why not instead just use the
caller tuple (the tuple that we currently go on to insert) when
initially shifting down (not the heap's preexisting last tuple, which
is guaranteed to go straight to the leaf level again)? That way, we
don't need to enlarge the heap at all through insertion, shifting up,
etc. We're done, and are *guaranteed* to have performed less work
(fewer comparisons and swaps) than with the existing approach (this is
the reason for my optimism about getting this stuff out of the way
early).

This new approach is more or less the *conventional* way to maintain
the heap invariant when returning elements from a heap during k-way
merging. Our existing approach is convoluted; merging was presumably
only coded that way because the generic functions
tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be
available. Perhaps the problem was masked by unrelated bottlenecks
that existed at the time, too.

I think that I could push this further (a minimum of 2 comparisons per
item returned when 3 or more tapes are active still seems like 1
comparison too many), but what I have here gets us most of the
benefit. And, it does so while not actually adding code that could be
called "overly clever", IMV. I'll probably leave clever, aggressive
optimization of merging for a later release.

[1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
[2] https://www.postgresql.org/message-id/CAM3SWZQ+2gJMNV7ChxwEXqXopLfb_FEW2RfEXHJ+GsYF39f6MQ@mail.gmail.com
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Let's get rid of the separate minor version numbers for shlibs
Next
From: Tom Lane
Date:
Subject: Re: PSA: Systemd will kill PostgreSQL