Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id CA+Tgmobi5zSYAVO9usxMWzfJxnkoh0p2mgP-ZeO=+o1iDOv-EQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel tuplesort (for parallel B-Tree index creation)  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Parallel tuplesort (for parallel B-Tree index creation)
List pgsql-hackers
On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I attach V5.

I gather that 0001, which puts a cap on the number of tapes, is not
actually related to the subject of this thread; it's an independent
change that you think is a good idea.  I reviewed the previous
discussion on this topic upthread, between you and Heikki, which seems
to me to contain more heat than light.  At least in my opinion, the
question is not whether a limit on the number of tapes is the best
possible system, but rather whether it's better than the status quo.
It's silly to refuse to make a simple change on the grounds that some
much more complex change might be better, because if somebody writes
that patch and it is better we can always revert 0001 then.  If 0001
involved hundreds of lines of invasive code changes, that argument
wouldn't apply, but it doesn't; it's almost a one-liner.

Now, on the other hand, as far as I can see, the actual amount of
evidence that 0001 is a good idea which has been presented in this
forum is pretty near zero.  You've argued for it on theoretical
grounds several times, but theoretical arguments are not a substitute
for test results.  Therefore, I decided that the best thing to do was
test it myself.  I wrote a little patch to add a GUC for
max_sort_tapes, which actually turns out not to work as I thought:
setting max_sort_tapes = 501 seems to limit the highest tape number to
501 rather than the number of tapes to 501, so there's a sort of
off-by-one error.  But that doesn't really matter.  The patch is
attached here for the convenience of anyone else who may want to
fiddle with this.

Next, I tried to set things up so that I'd get a large enough number
of tapes for the cap to matter.  To do that, I initialized with
"pgbench -i --unlogged-tables -s 20000" so that I had 2 billion
tuples.  Then I used this SQL query: "select sum(w+abalance) from
(select (aid::numeric * 7123000217)%1000000000 w, * from
pgbench_accounts order by 1) x".  The point of the math is to perturb
the ordering of the tuples so that they actually need to be sorted
instead of just passed through unchanged. The use of abalance in the
outer sum prevents an index-only-scan from being used, which makes the
sort wider; perhaps I should have tried to make it wider still, but
this is what I did.  I wanted to have more than 501 tapes because,
obviously, a concern with a change like this is that things might get
slower in the case where it forces a polyphase merge rather than a
single merge pass. And, of course, I set trace_sort = on.

Here's what my initial run looked like, in brief:

2016-11-09 15:37:52 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 15:37:59 UTC [44026] LOG:  switching to external sort with
937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s
2016-11-09 16:48:31 UTC [44026] LOG:  finished writing run 616 to tape
615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s
2016-11-09 16:48:31 UTC [44026] LOG:  using 246719 KB of memory for
read buffers among 616 input tapes
2016-11-09 16:48:39 UTC [44026] LOG:  performsort done (except 616-way
final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41
s
2016-11-09 18:33:30 UTC [44026] LOG:  external sort ended, 6255145
disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed:
10538.06 s

And according to psql: Time: 10538068.225 ms (02:55:38.068)

Then I set max_sort_tapes = 501 and ran it again.  This time:

2016-11-09 19:05:22 UTC [44026] LOG:  begin tuple sort: nkeys = 1,
workMem = 262144, randomAccess = f
2016-11-09 19:05:28 UTC [44026] LOG:  switching to external sort with
502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s
2016-11-09 20:15:20 UTC [44026] LOG:  finished writing run 577 to tape
75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s
2016-11-09 20:15:20 UTC [44026] LOG:  using 249594 KB of memory for
read buffers among 501 input tapes
2016-11-09 20:21:19 UTC [44026] LOG:  finished 77-way merge step: CPU:
user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s
2016-11-09 20:21:19 UTC [44026] LOG:  performsort done (except 501-way
final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22
s
2016-11-09 21:38:12 UTC [44026] LOG:  external sort ended, 6255484
disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed:
9170.62 s

And this one, according to psql: Time: 9170629.597 ms (02:32:50.630)

That looks very good.  On a test that runs for almost 3 hours, we
saved more than 20 minutes.  The overall runtime improvement is 23% in
a case where we would not expect this patch to do particularly well;
after all, without limiting the number of runs, we are able to
complete the sort with a single merge pass, whereas when we reduce the
number of runs, we now require a polyphase merge.  Nevertheless, we
come out way ahead, because the final merge pass gets way faster,
presumably because there are fewer tapes involved.  The first test
does a 616-way final merge and takes 6184.34 seconds to do it.  The
second test does a 501-way final merge and takes 4519.31 seconds to
do.  This increased final merge speed accounts for practically all of
the speedup, and the reason it's faster pretty much has to be that
it's merging fewer tapes.

That, in turn, happens for two reasons.  First, because limiting the
number of tapes increases slightly the memory available for storing
the tuples belonging to each run, we end up with fewer runs in the
first place.  The number of runs drops from from 616 to 577, about a
7% reduction.  Second, because we have more runs than tapes in the
second case, it does a 77-way merge prior to the final merge.  Because
of that 77-way merge, the time at which the second run starts
producing tuples is slightly later.  Instead of producing the first
tuple at 70:47.71, we have to wait until 75:72.22.  That's a small
disadvantage in this case, because it's hypothetically possible that a
query like this could have a LIMIT and we'd end up worse off overall.
However, that's pretty unlikely, for three reasons.  Number one, LIMIT
isn't likely to be used on queries of this type in the first place.
Number two, if it were used, we'd probably end up with a bounded sort
plan which would be way faster anyway.  Number three, if somehow we
still sorted the data set we'd still win in this case if the limit
were more than about 20% of the total number of tuples.  The much
faster run time to produce the whole data set is a small price to pay
for possibly needing to wait a little longer for the first tuple.

Admittedly, this is only one test, and some other test might show a
different result.  However, I believe that there aren't likely to be
many losing cases.  If the increased number of tapes doesn't force a
polyphase merge, we're almost certain to win, because in that case the
only thing that changes is that we have more memory with which to
produce each run.  On small sorts, this may not help much, but it
won't hurt.  Even if the increased number of tapes *does* force a
polyphase merge, the reduction in the number of initial runs and/or
the reduction in the number of runs in any single merge may add up to
a win, as in this example.  In fact, it may well be the case that the
optimal number of tapes is significantly less than 501.  It's hard to
tell for sure, but it sure looks like that 77-way non-final merge is
significantly more efficient than the final merge.

So, I'm now feeling pretty bullish about this patch, except for one
thing, which is that I think the comments are way off-base. Peter
writes: $$When allowedMem is significantly lower than what is required
for an internal sort, it is unlikely that there are benefits to
increasing the number of tapes beyond Knuth's "sweet spot" of 7.$$
I'm pretty sure that's totally wrong, first of all because commit
df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing
precisely the thing which this comment says we shouldn't, secondly
because 501 is most definitely significantly higher than 7 so the code
and the comment don't even match, and thirdly because, as the comment
added in the commit says, each extra tape doesn't really cost that
much.  In this example, going from 501 tapes up to 937 tapes only
reduces memory available for tuples by about 7%, even though the
number of tapes have almost doubled.  If we had a sort with, say, 30
runs, do we really want to do a polyphase merge just to get a sub-1%
increase in the amount of memory per run?  I doubt it.

Given all that, what I'm inclined to do is rewrite the comment to say,
basically, that even though we can afford lots of tapes, it's better
not to allow too ridiculously many because (1) that eats away at the
amount of memory available for tuples in each initial run and (2) very
high-order final merges are not very efficient.  And then commit that.
If somebody wants to fine-tune the tape limit later after more
extensive testing or replacing it by some other system that is better,
great.

Sound OK?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Clifford Hammerschmidt
Date:
Subject: Re: C based plugins, clocks, locks, and configuration variables
Next
From: Robert Haas
Date:
Subject: Re: [COMMITTERS] pgsql: pgbench: Allow the transaction log file prefix to be changed.