Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CA+Tgmobi5zSYAVO9usxMWzfJxnkoh0p2mgP-ZeO=+o1iDOv-EQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Mon, Nov 7, 2016 at 11:28 PM, Peter Geoghegan <pg@heroku.com> wrote: > I attach V5. I gather that 0001, which puts a cap on the number of tapes, is not actually related to the subject of this thread; it's an independent change that you think is a good idea. I reviewed the previous discussion on this topic upthread, between you and Heikki, which seems to me to contain more heat than light. At least in my opinion, the question is not whether a limit on the number of tapes is the best possible system, but rather whether it's better than the status quo. It's silly to refuse to make a simple change on the grounds that some much more complex change might be better, because if somebody writes that patch and it is better we can always revert 0001 then. If 0001 involved hundreds of lines of invasive code changes, that argument wouldn't apply, but it doesn't; it's almost a one-liner. Now, on the other hand, as far as I can see, the actual amount of evidence that 0001 is a good idea which has been presented in this forum is pretty near zero. You've argued for it on theoretical grounds several times, but theoretical arguments are not a substitute for test results. Therefore, I decided that the best thing to do was test it myself. I wrote a little patch to add a GUC for max_sort_tapes, which actually turns out not to work as I thought: setting max_sort_tapes = 501 seems to limit the highest tape number to 501 rather than the number of tapes to 501, so there's a sort of off-by-one error. But that doesn't really matter. The patch is attached here for the convenience of anyone else who may want to fiddle with this. Next, I tried to set things up so that I'd get a large enough number of tapes for the cap to matter. To do that, I initialized with "pgbench -i --unlogged-tables -s 20000" so that I had 2 billion tuples. Then I used this SQL query: "select sum(w+abalance) from (select (aid::numeric * 7123000217)%1000000000 w, * from pgbench_accounts order by 1) x". The point of the math is to perturb the ordering of the tuples so that they actually need to be sorted instead of just passed through unchanged. The use of abalance in the outer sum prevents an index-only-scan from being used, which makes the sort wider; perhaps I should have tried to make it wider still, but this is what I did. I wanted to have more than 501 tapes because, obviously, a concern with a change like this is that things might get slower in the case where it forces a polyphase merge rather than a single merge pass. And, of course, I set trace_sort = on. Here's what my initial run looked like, in brief: 2016-11-09 15:37:52 UTC [44026] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 15:37:59 UTC [44026] LOG: switching to external sort with 937 tapes: CPU: user: 5.51 s, system: 0.27 s, elapsed: 6.56 s 2016-11-09 16:48:31 UTC [44026] LOG: finished writing run 616 to tape 615: CPU: user: 4029.17 s, system: 152.72 s, elapsed: 4238.54 s 2016-11-09 16:48:31 UTC [44026] LOG: using 246719 KB of memory for read buffers among 616 input tapes 2016-11-09 16:48:39 UTC [44026] LOG: performsort done (except 616-way final merge): CPU: user: 4030.30 s, system: 152.98 s, elapsed: 4247.41 s 2016-11-09 18:33:30 UTC [44026] LOG: external sort ended, 6255145 disk blocks used: CPU: user: 10214.64 s, system: 175.24 s, elapsed: 10538.06 s And according to psql: Time: 10538068.225 ms (02:55:38.068) Then I set max_sort_tapes = 501 and ran it again. This time: 2016-11-09 19:05:22 UTC [44026] LOG: begin tuple sort: nkeys = 1, workMem = 262144, randomAccess = f 2016-11-09 19:05:28 UTC [44026] LOG: switching to external sort with 502 tapes: CPU: user: 5.69 s, system: 0.26 s, elapsed: 6.13 s 2016-11-09 20:15:20 UTC [44026] LOG: finished writing run 577 to tape 75: CPU: user: 3993.81 s, system: 153.42 s, elapsed: 4198.52 s 2016-11-09 20:15:20 UTC [44026] LOG: using 249594 KB of memory for read buffers among 501 input tapes 2016-11-09 20:21:19 UTC [44026] LOG: finished 77-way merge step: CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s 2016-11-09 20:21:19 UTC [44026] LOG: performsort done (except 501-way final merge): CPU: user: 4329.50 s, system: 160.67 s, elapsed: 4557.22 s 2016-11-09 21:38:12 UTC [44026] LOG: external sort ended, 6255484 disk blocks used: CPU: user: 8848.81 s, system: 182.64 s, elapsed: 9170.62 s And this one, according to psql: Time: 9170629.597 ms (02:32:50.630) That looks very good. On a test that runs for almost 3 hours, we saved more than 20 minutes. The overall runtime improvement is 23% in a case where we would not expect this patch to do particularly well; after all, without limiting the number of runs, we are able to complete the sort with a single merge pass, whereas when we reduce the number of runs, we now require a polyphase merge. Nevertheless, we come out way ahead, because the final merge pass gets way faster, presumably because there are fewer tapes involved. The first test does a 616-way final merge and takes 6184.34 seconds to do it. The second test does a 501-way final merge and takes 4519.31 seconds to do. This increased final merge speed accounts for practically all of the speedup, and the reason it's faster pretty much has to be that it's merging fewer tapes. That, in turn, happens for two reasons. First, because limiting the number of tapes increases slightly the memory available for storing the tuples belonging to each run, we end up with fewer runs in the first place. The number of runs drops from from 616 to 577, about a 7% reduction. Second, because we have more runs than tapes in the second case, it does a 77-way merge prior to the final merge. Because of that 77-way merge, the time at which the second run starts producing tuples is slightly later. Instead of producing the first tuple at 70:47.71, we have to wait until 75:72.22. That's a small disadvantage in this case, because it's hypothetically possible that a query like this could have a LIMIT and we'd end up worse off overall. However, that's pretty unlikely, for three reasons. Number one, LIMIT isn't likely to be used on queries of this type in the first place. Number two, if it were used, we'd probably end up with a bounded sort plan which would be way faster anyway. Number three, if somehow we still sorted the data set we'd still win in this case if the limit were more than about 20% of the total number of tuples. The much faster run time to produce the whole data set is a small price to pay for possibly needing to wait a little longer for the first tuple. Admittedly, this is only one test, and some other test might show a different result. However, I believe that there aren't likely to be many losing cases. If the increased number of tapes doesn't force a polyphase merge, we're almost certain to win, because in that case the only thing that changes is that we have more memory with which to produce each run. On small sorts, this may not help much, but it won't hurt. Even if the increased number of tapes *does* force a polyphase merge, the reduction in the number of initial runs and/or the reduction in the number of runs in any single merge may add up to a win, as in this example. In fact, it may well be the case that the optimal number of tapes is significantly less than 501. It's hard to tell for sure, but it sure looks like that 77-way non-final merge is significantly more efficient than the final merge. So, I'm now feeling pretty bullish about this patch, except for one thing, which is that I think the comments are way off-base. Peter writes: $$When allowedMem is significantly lower than what is required for an internal sort, it is unlikely that there are benefits to increasing the number of tapes beyond Knuth's "sweet spot" of 7.$$ I'm pretty sure that's totally wrong, first of all because commit df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing precisely the thing which this comment says we shouldn't, secondly because 501 is most definitely significantly higher than 7 so the code and the comment don't even match, and thirdly because, as the comment added in the commit says, each extra tape doesn't really cost that much. In this example, going from 501 tapes up to 937 tapes only reduces memory available for tuples by about 7%, even though the number of tapes have almost doubled. If we had a sort with, say, 30 runs, do we really want to do a polyphase merge just to get a sub-1% increase in the amount of memory per run? I doubt it. Given all that, what I'm inclined to do is rewrite the comment to say, basically, that even though we can afford lots of tapes, it's better not to allow too ridiculously many because (1) that eats away at the amount of memory available for tuples in each initial run and (2) very high-order final merges are not very efficient. And then commit that. If somebody wants to fine-tune the tape limit later after more extensive testing or replacing it by some other system that is better, great. Sound OK? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: