Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CAM3SWZQX-ZThSg_avkMe_vsqYD_F-damRd=ks4vUD5pU_Zc7PQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I gather that 0001, which puts a cap on the number of tapes, is not
> actually related to the subject of this thread; it's an independent
> change that you think is a good idea.  I reviewed the previous
> discussion on this topic upthread, between you and Heikki, which seems
> to me to contain more heat than light.

FWIW, I don't remember it that way. Heikki seemed to be uncomfortable
with the quasi-arbitrary choice of constant, rather than disagreeing
with the general idea of a cap. Or, maybe he thought I didn't go far
enough, by completely removing polyphase merge. I think that removing
polyphase merge would be an orthogonal change to this, though.

> Now, on the other hand, as far as I can see, the actual amount of
> evidence that 0001 is a good idea which has been presented in this
> forum is pretty near zero.  You've argued for it on theoretical
> grounds several times, but theoretical arguments are not a substitute
> for test results.

See the illustration in TAOCP, vol III, page 273 in the second edition
-- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think
that it's actually a real-world benchmark.

I guess I felt that no one ever argued that as many tapes as possible
was sound on any grounds, even theoretical, and so didn't feel
obligated to test it until asked to do so. I think that the reason
that a cap like this didn't go in around the time that the growth
logic went in (2006) was because nobody followed up on it. If you look
at the archives, there is plenty of discussion of a cap like this at
the time.

> That looks very good.  On a test that runs for almost 3 hours, we
> saved more than 20 minutes.  The overall runtime improvement is 23% in
> a case where we would not expect this patch to do particularly well;
> after all, without limiting the number of runs, we are able to
> complete the sort with a single merge pass, whereas when we reduce the
> number of runs, we now require a polyphase merge.  Nevertheless, we
> come out way ahead, because the final merge pass gets way faster,
> presumably because there are fewer tapes involved.  The first test
> does a 616-way final merge and takes 6184.34 seconds to do it.  The
> second test does a 501-way final merge and takes 4519.31 seconds to
> do.  This increased final merge speed accounts for practically all of
> the speedup, and the reason it's faster pretty much has to be that
> it's merging fewer tapes.

It's CPU cache efficiency -- has to be.

> That, in turn, happens for two reasons.  First, because limiting the
> number of tapes increases slightly the memory available for storing
> the tuples belonging to each run, we end up with fewer runs in the
> first place.  The number of runs drops from from 616 to 577, about a
> 7% reduction.  Second, because we have more runs than tapes in the
> second case, it does a 77-way merge prior to the final merge.  Because
> of that 77-way merge, the time at which the second run starts
> producing tuples is slightly later.  Instead of producing the first
> tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
> disadvantage in this case, because it's hypothetically possible that a
> query like this could have a LIMIT and we'd end up worse off overall.
> However, that's pretty unlikely, for three reasons.  Number one, LIMIT
> isn't likely to be used on queries of this type in the first place.
> Number two, if it were used, we'd probably end up with a bounded sort
> plan which would be way faster anyway.  Number three, if somehow we
> still sorted the data set we'd still win in this case if the limit
> were more than about 20% of the total number of tuples.  The much
> faster run time to produce the whole data set is a small price to pay
> for possibly needing to wait a little longer for the first tuple.

Cool.

> So, I'm now feeling pretty bullish about this patch, except for one
> thing, which is that I think the comments are way off-base. Peter
> writes: $When allowedMem is significantly lower than what is required
> for an internal sort, it is unlikely that there are benefits to
> increasing the number of tapes beyond Knuth's "sweet spot" of 7.$
> I'm pretty sure that's totally wrong, first of all because commit
> df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
> precisely the thing which this comment says we shouldn't

It's more complicated than that. As I said, I think that Knuth
basically had it right with his sweet spot of 7. I think that commit
df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part
because a one-pass merge avoided certain overheads not inherent to
polyphase merge, like all that memory accounting stuff, extra palloc()
traffic, etc. The expanded use of per tape buffering we have even in
multi-pass cases likely makes that much less true for us these days.

The reason I haven't actually gone right back down to 7 with this cap
is that it's possible that the added I/O costs outweigh the CPU costs
in extreme cases, even though I think that polyphase merge doesn't
have all that much to do with I/O costs, even with its 1970s
perspective. Knuth doesn't say much about I/O costs -- it's more about
using an extremely small amount of memory effectively (minimizing CPU
costs with very little available main memory).

Furthermore, not limiting ourselves to 7 tapes and seeing a benefit
(benefitting from a few dozen or hundred instead) seems more possible
with the improved merge heap maintenance logic added recently, where
there could be perhaps hundreds of runs merged with very low CPU cost
in the event of presorted input (or, input that is inversely
logically/physically correlated). That would be true because we'd only
examine the top of the heap through, and so I/O costs may matter much
more.

Depending on the exact details, I bet you could see a benefit with
only 7 tapes due to CPU cache efficiency in a case like the one you
describe. Perhaps when sorting integers, but not when sorting collated
text. There are many competing considerations, which I've tried my
best to balance here with a merge order of 500.

> Sound OK?

I'm fine with not mentioning Knuth's sweet spot once more. I guess
it's not of much practical value that he was on to something with
that. I realize, on reflection, that my understanding of what's going
on is very nuanced.

Thanks
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: Add support for SRF and returning composites to pl/tcl
Next
From: Peter Geoghegan
Date:
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)