Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZQX-ZThSg_avkMe_vsqYD_F-damRd=ks4vUD5pU_Zc7PQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Wed, Nov 9, 2016 at 4:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I gather that 0001, which puts a cap on the number of tapes, is not > actually related to the subject of this thread; it's an independent > change that you think is a good idea. I reviewed the previous > discussion on this topic upthread, between you and Heikki, which seems > to me to contain more heat than light. FWIW, I don't remember it that way. Heikki seemed to be uncomfortable with the quasi-arbitrary choice of constant, rather than disagreeing with the general idea of a cap. Or, maybe he thought I didn't go far enough, by completely removing polyphase merge. I think that removing polyphase merge would be an orthogonal change to this, though. > Now, on the other hand, as far as I can see, the actual amount of > evidence that 0001 is a good idea which has been presented in this > forum is pretty near zero. You've argued for it on theoretical > grounds several times, but theoretical arguments are not a substitute > for test results. See the illustration in TAOCP, vol III, page 273 in the second edition -- "Fig. 70. Efficiency of Polyphase merge using Algorithm D". I think that it's actually a real-world benchmark. I guess I felt that no one ever argued that as many tapes as possible was sound on any grounds, even theoretical, and so didn't feel obligated to test it until asked to do so. I think that the reason that a cap like this didn't go in around the time that the growth logic went in (2006) was because nobody followed up on it. If you look at the archives, there is plenty of discussion of a cap like this at the time. > That looks very good. On a test that runs for almost 3 hours, we > saved more than 20 minutes. The overall runtime improvement is 23% in > a case where we would not expect this patch to do particularly well; > after all, without limiting the number of runs, we are able to > complete the sort with a single merge pass, whereas when we reduce the > number of runs, we now require a polyphase merge. Nevertheless, we > come out way ahead, because the final merge pass gets way faster, > presumably because there are fewer tapes involved. The first test > does a 616-way final merge and takes 6184.34 seconds to do it. The > second test does a 501-way final merge and takes 4519.31 seconds to > do. This increased final merge speed accounts for practically all of > the speedup, and the reason it's faster pretty much has to be that > it's merging fewer tapes. It's CPU cache efficiency -- has to be. > That, in turn, happens for two reasons. First, because limiting the > number of tapes increases slightly the memory available for storing > the tuples belonging to each run, we end up with fewer runs in the > first place. The number of runs drops from from 616 to 577, about a > 7% reduction. Second, because we have more runs than tapes in the > second case, it does a 77-way merge prior to the final merge. Because > of that 77-way merge, the time at which the second run starts > producing tuples is slightly later. Instead of producing the first > tuple at 70:47.71, we have to wait until 75:72.22. That's a small > disadvantage in this case, because it's hypothetically possible that a > query like this could have a LIMIT and we'd end up worse off overall. > However, that's pretty unlikely, for three reasons. Number one, LIMIT > isn't likely to be used on queries of this type in the first place. > Number two, if it were used, we'd probably end up with a bounded sort > plan which would be way faster anyway. Number three, if somehow we > still sorted the data set we'd still win in this case if the limit > were more than about 20% of the total number of tuples. The much > faster run time to produce the whole data set is a small price to pay > for possibly needing to wait a little longer for the first tuple. Cool. > So, I'm now feeling pretty bullish about this patch, except for one > thing, which is that I think the comments are way off-base. Peter > writes: $When allowedMem is significantly lower than what is required > for an internal sort, it is unlikely that there are benefits to > increasing the number of tapes beyond Knuth's "sweet spot" of 7.$ > I'm pretty sure that's totally wrong, first of all because commit > df700e6b40195d28dc764e0c694ac8cef90d4638 improved performance by doing > precisely the thing which this comment says we shouldn't It's more complicated than that. As I said, I think that Knuth basically had it right with his sweet spot of 7. I think that commit df700e6b40195d28dc764e0c694ac8cef90d4638 was effective in large part because a one-pass merge avoided certain overheads not inherent to polyphase merge, like all that memory accounting stuff, extra palloc() traffic, etc. The expanded use of per tape buffering we have even in multi-pass cases likely makes that much less true for us these days. The reason I haven't actually gone right back down to 7 with this cap is that it's possible that the added I/O costs outweigh the CPU costs in extreme cases, even though I think that polyphase merge doesn't have all that much to do with I/O costs, even with its 1970s perspective. Knuth doesn't say much about I/O costs -- it's more about using an extremely small amount of memory effectively (minimizing CPU costs with very little available main memory). Furthermore, not limiting ourselves to 7 tapes and seeing a benefit (benefitting from a few dozen or hundred instead) seems more possible with the improved merge heap maintenance logic added recently, where there could be perhaps hundreds of runs merged with very low CPU cost in the event of presorted input (or, input that is inversely logically/physically correlated). That would be true because we'd only examine the top of the heap through, and so I/O costs may matter much more. Depending on the exact details, I bet you could see a benefit with only 7 tapes due to CPU cache efficiency in a case like the one you describe. Perhaps when sorting integers, but not when sorting collated text. There are many competing considerations, which I've tried my best to balance here with a merge order of 500. > Sound OK? I'm fine with not mentioning Knuth's sweet spot once more. I guess it's not of much practical value that he was on to something with that. I realize, on reflection, that my understanding of what's going on is very nuanced. Thanks -- Peter Geoghegan
pgsql-hackers by date: