Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CAM3SWZQG4jJHH7AXCXTN7iHKdEh4eqgn+dB30_qA_yLtwR2-4A@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Using quicksort for every external sort run
|
List | pgsql-hackers |
On Tue, Dec 22, 2015 at 2:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 22, 2015 at 4:37 PM, Peter Geoghegan <pg@heroku.com> wrote: >> That's not fair. DEFAULT_EQ_SEL, DEFAULT_RANGE_INEQ_SEL, and >> DEFAULT_NUM_DISTINCT are each about as arbitrary. We have to do >> something, though. >> > Sure, there are arbitrary numbers all over the code, driven by > empirical observations about what factors are important to model. But > this is not that. You don't have a thing called seq_page_cost and a > thing called cpu_tuple_cost and then say, well, empirically the ratio > is about 100:1, so let's make the former 1 and the latter 0.01. You > just have some numbers, and it's not clear what, if anything, they > actually represent. What I find difficult to accept about what you say here is that at *this* level, something like cost_sort() has little to recommend it. It costs a sort of a text attribute at the same level as the cost of sorting the same tuples using an int4 attribute (based on the default cpu_operator_cost for C functions -- without any attempt to differentiate text and int4). Prior to 9.5, sorting text took about 5 - 10 times longer that this similar int4 sort. That's a pretty big difference, and yet I recall no complaints. The cost of a comparison in a sort can hardly be considered in isolation, anyway -- cache efficiency is at least as important. Of course, the point is that the goal of a cost model is not to simulate reality as closely as possible -- it's to produce a good outcome for performance purposes under realistic assumptions. Realistic assumptions include that you can't hope to account for certain differences in cost. Avoiding a terrible outcome is very important, but the worst case for useselection() is no worse than today's behavior (or a lost opportunity to do better than today's behavior). Recently, the paper that was posted to the list about the Postgres optimizer stated formally what I know I had a good intuitive sense of for a long time: that better selectivity estimates are much more important than better cost models in practice. The "empirical observations" driving something like DEFAULT_EQ_SEL are very weak -- but what are you gonna do? > The crossover point is clamped to a minimum of 40% [constant #1] and a > maximum of 85% [constant #2] when the size of the SortTuple array is > no more than MaxAllocSize. Between those bounds, the crossover point > is 90% [constant #3] minus 7.5% [constant #4] per 32-byte increment > [constant #5] of estimated average tuple size. On the other hand, > when the estimated average tuple size exceeds MaxAllocSize, the > crossover point is either 90% [constant #6] or 95% [constant #7] > depending on whether the average tuple size is greater than 32 bytes > [constant #8]. But if the row count hit is less than 50% [constant > #9] of the rows we've already seen, then we ignore it and do not use > selection. > > You make no attempt to justify why any of these numbers are correct, > or what underlying physical reality they represent. Just like selfuncs.h for the most part, then. > The comment which > describes the manner in which crossover point is computed for > SortTuple volumes under 1GB says "Starting from a threshold of 90%, > refund 7.5% per 32 byte average-size-increment." That is a precise > restatement of what the code does, but it doesn't attempt to explain > why it's a good idea. Perhaps the reader should infer that the > crossover point drops as the tuples get bigger, except that in the > over-1GB case, a larger tuple size causes the crossover point to go > *up* while in the under-1GB case, a larger tuple size causes the > crossover point to go *down*. Concretely, if we're sorting 44,739,242 > 224-byte tuples, the estimated crossover point is 40%. If we're > sorting 44,739,243 244-byte tuples, the estimated crossover point is > 95%. That's an extremely sharp discontinuity, and it seems very > unlikely that any real system behaves that way. Again, the goal of the cost model is not to model reality as such. This cost model is conservative about using replacement selection. It makes sense when you consider that there tends to be a lot fewer external sorts on a realistic workload -- if we can cut that number in half, which seems quite possible, that's pretty good, especially from a DBA's practical perspective. I want to buffer DBAs against suddenly incurring more I/O, but not at the risk of having a far longer sort for the first run. Or with minimal exposure to that risk. The cost model weighs the cost of the hint being wrong to some degree (which is indeed novel). I think it makes sense in light of the cost and benefits in this case, although I will add that I'm not entirely comfortable with it. I just don't imagine that there is a solution that I will be fully comfortable with. There may be one that superficially looks correct, but I see little point in that. > I'm prepared to concede that constant #9 - ignoring the input row > estimate if we've already seen twice that many rows - probably doesn't > need a whole lot of justification here, and what justification it does > need is provided by the fact that (we think) replacement selection > only wins when there are going to be less than 2 quicksorted runs. > But the other 8 constants here have to have reasons why they exist, > what they represent, and why they have the values they do, and that > explanation needs to be something that can be understood by people > besides you. The overall cost model needs some explanation of the > theory of operation, too. The cost model is extremely fudged. I think that the greatest problem that it has is that it isn't explicit enough about that. But yes, let me concede more clearly: the cost model is based on frobbing. But at least it's relatively honest about that, and is relatively simple. I think it might be possible to make it simpler, but I have a feeling that anything we can come up with will basically have the same quality that you so dislike. I don't know how to do better. Frankly, I'd rather be roughly correct than exactly wrong. > In my opinion, reasoning in terms of a crossover point is a strange > way of approaching the problem. What would be more typical at least > in our code, and I suspect in general, is do a cost estimate of using > selection and a cost estimate of not using selection and compare them. > Replacement selection has a CPU cost and an I/O cost, each of which is > estimable based on the tuple count, chosen comparator, and expected > I/O volume. Quicksort has those same costs, in different amounts. If > those respective costs are accurately estimated, then you can pick the > strategy with the lower cost and expect to win. If you instrument the number of comparisons, I expect you'll find that master is very competitive with the patch in terms of number of comparisons performed in total. I think it might even win (Knuth specifically addresses this, actually). Where does that leave your theory of how to build a cost model? Also, the disadvantage of replacement selection's heap is smaller with smaller work_mem settings -- this has been shown many times to make a *huge* difference. Can the alternative cost model be reasonably expected to incorporate that, too? Heap sort isn't cache oblivious, which is why we see these weird effects, so don't forget to have CPU cache size as an input into your cost model (or maybe use a magic value based on something like MaxAllocSize!). How do you propose to way the distributed cost of a lost opportunity to reduce I/O against the distributed cost of heapsort wasting system memory bandwidth? And so on, and so on...believe me, I could go on. By the way, I think that there needs to be a little work done to cost_sort() too, which so far I've avoided. >> However, where replacement selection can still help is avoiding I/O >> *entirely*. If we can avoid spilling 95% of tuples in the first place, >> and quicksort the remaining (heapified) tuples that were not spilled, >> and merge an in-memory run with an on-tape run, then we can win big. > > That's pretty much what I was trying to say, except that I'm curious > to know whether replacement selection can win when it manages to > generate a vastly longer run than what we get from quicksorting. Say > quicksorting produces 10, or 100, or 1000 tapes, and replacement > selection produces 1 due to a favorable data distribution. I believe the answer is probably no, but if there is a counter example, it probably isn't worth pursuing. To repeat myself, I started out with exactly the same intuition as you on that question, but changed my mind when my efforts to experimentally verify the intuition were not successful. > I agree, but that's not what I proposed. You don't want to keep > re-sorting to incorporate new tuples into the run, but if you've got > 1010 tuples and you can fit 1000 tuples in, you can (a) quicksort the > first 1000 tuples, (b) read in 10 more tuples, dumping the first 10 > tuples from run 0 to disk, (c) quicksort the last 10 tuples to create > run 1, and then (d) merge run 0 [which is mostly in memory] with run 1 > [which is entirely in memory]. In other words, yes, quicksorting > doesn't let you add things to the sort incrementally, but you can > still write out the run incrementally, writing only as many tuples as > you need to dump to get the rest of the input data into memory. Merging is still sorting. The 10 tuples are not very cheap to merge against the 1000 tuples, because you'll probably still end up reading most of the 1000 tuples to do so. Perhaps you anticipate that there will be roughly disjoint ranges of values in each run due to a logical/physical correlation, and so you won't have to read that many of the 1000 tuples, but this approach has no ability to buffer even one outlier value (unlike replacement selection, in particular my approach within mergememruns()). The cost of heapification of 1.01 million tuples to spill 0.01 million tuples is pretty low (relative to the cost of sorting them in particular). The only difference between what you say here and what I actually do is that the remaining tuples are heapified rather than sorted, and I quicksort everything together to "merge run 1 and run 0" rather than doing two quicksorts and a merge. I believe that this can be demonstrated to be cheaper. Another factor is that the heap could be useful for other stuff in the future. As Simon Riggs pointed out, for deduplicating values as they're read in by tuplesort. (Okay, that's really the only other thing, but it's a good one). -- Peter Geoghegan
pgsql-hackers by date: