Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZRhBhiknTF_=NjDSnNZ11hx=U_SEYwbc5vd=x7M4mMiCw@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Wed, Aug 3, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > Since merging is a big bottleneck with this, we should probably also > work to address that indirectly. I attach a patch that changes how we maintain the heap invariant during tuplesort merging. I already mentioned this over on the "Parallel tuplesort, partitioning, merging, and the future" thread. As noted already on that thread, this patch makes merging clustered numeric input about 2.1x faster overall in one case, which is particularly useful in the context of a serial final/leader merge during a parallel CREATE INDEX. Even *random* non-C-collated text input is made significantly faster. This work is totally orthogonal to parallelism, though; it's just very timely, given our discussion of the merge bottleneck on this thread. If I benchmark a parallel build of a 100 million row index, with presorted input, I can see a 71% reduction in *comparisons* with 8 tapes/workers, and an 80% reduction in comparisons with 16 workers/tapes in one instance (the numeric case I just mentioned). With random input, we can still come out significantly ahead, but not to the same degree. I was able to see a reduction in comparisons during a leader merge, from 1,468,522,397 comparisons to 999,755,569 comparisons, which is obviously still quite significant (worker merges, if any, benefit too). I think I need to redo my parallel CREATE INDEX benchmark, so that you can take this into account. Also, I think that this patch will make very large external sorts that naturally have tens of runs to merge significantly faster, but I didn't bother to benchmark that. The patch is intended to be applied on top of parallel B-Tree patches 0001-* and 0002-* [1]. I happened to test it with parallelism, but these are all independently useful, and will be entered as a separate CF entry (perhaps better to commit the earlier two patches first, to avoid merge conflicts). I'm optimistic that we can get those 3 patches in the series out of the way early, without blocking on discussing parallel sort. The patch makes tuplesort merging shift down and displace the root tuple with the tape's next preread tuple, rather than compacting and then inserting into the heap anew. This approach to maintaining the heap as tuples are returned to caller will always produce fewer comparisons overall. The new approach is also simpler. We were already shifting down to compact the heap within the misleadingly named [2] function tuplesort_heap_siftup() -- why not instead just use the caller tuple (the tuple that we currently go on to insert) when initially shifting down (not the heap's preexisting last tuple, which is guaranteed to go straight to the leaf level again)? That way, we don't need to enlarge the heap at all through insertion, shifting up, etc. We're done, and are *guaranteed* to have performed less work (fewer comparisons and swaps) than with the existing approach (this is the reason for my optimism about getting this stuff out of the way early). This new approach is more or less the *conventional* way to maintain the heap invariant when returning elements from a heap during k-way merging. Our existing approach is convoluted; merging was presumably only coded that way because the generic functions tuplesort_heap_siftup() and tuplesort_heap_insert() happened to be available. Perhaps the problem was masked by unrelated bottlenecks that existed at the time, too. I think that I could push this further (a minimum of 2 comparisons per item returned when 3 or more tapes are active still seems like 1 comparison too many), but what I have here gets us most of the benefit. And, it does so while not actually adding code that could be called "overly clever", IMV. I'll probably leave clever, aggressive optimization of merging for a later release. [1] https://www.postgresql.org/message-id/CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com [2] https://www.postgresql.org/message-id/CAM3SWZQ+2gJMNV7ChxwEXqXopLfb_FEW2RfEXHJ+GsYF39f6MQ@mail.gmail.com -- Peter Geoghegan
Attachment
pgsql-hackers by date: