Re: Memory usage during sorting - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Memory usage during sorting |
Date | |
Msg-id | CA+Tgmob-KrZoH88d4W0C-FEcjaF=WFKxPg=TfZ1HZZpvHddokg@mail.gmail.com Whole thread Raw |
In response to | Re: Memory usage during sorting (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Memory usage during sorting
Re: Memory usage during sorting Re: Memory usage during sorting |
List | pgsql-hackers |
On Sun, Mar 18, 2012 at 11:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So has somebody found a hole in the n log n lower bound on the cost of > comparison-based sorting? I thought that had been proven pretty > rigorously. There's not much danger of anyone finding a way around that bound since the proof is quite trivial, but keep in mind that it's a worst-case bound. For example, if you have reason to believe that the input is likely to already be sorted, you can check for that case in using just n-1 comparisons. If it turns out you're right, you can "sort" the data in linear time. Heck, you can also check for reverse-sorted input at the same time, if you like, and handle that special case too. Of course, when the data is unordered, such special-case checks hurt rather than help, even though it's still O(n lg n) overall; those pesky constant factors do matter quite a bit. Still, sometimes it's worth it: our quicksort algorithm checks for sorted input on each recursive invocation, since quicksort degrades rather badly in that scenario; but our heapsort doesn't contain a similar check, probably because heapsort typically works fine in that scenario. One thing that seems inefficient to me about our current algorithm is the use of the run number as a leading column in the sort key. There's no real reason why the tuples destined for the next run need to be maintained in heap order; we could just store them unordered and heapify the whole lot of them when it's time to start the next run. That ought to save comparison cycles for the current run, since the heap will be less deep, and there are other possible savings as well - in particular, an unordered array of tuples can be heapify'd in linear time, whereas our current algorithm is O(n lg n). However, when I actually implemented this, I found that doing it this way was a loser.In fact, excluding the next-run tuples from the heapfor the current run was a BIG loser even before you add in the time required to re-heapify. This confused the daylights out of me for a while, because it's hard to understand how insert and siftup can be slower on a small heap than a large one. My working theory is that, even though we must necessarily do more comparisons when manipulating a larger heap, many of those comparisons are resolved by comparing run numbers, and that's a lot cheaper than invoking the real comparator. For example, if we insert into a heap whose rightmost child is in the next run, and the new tuple goes into the current run, and the current size of the heap is such that the initial position of the new element is a descendent of the right node, it will very quickly crash through all of the next-run tuples and only need one REAL comparison - against the root. Either the new element ends up as the root, or it ends up as the direct child of the root; now we remove the root and, perhaps, replace it with its rightmost child, meaning that the next element we read in can do the exact same thing all over again. If we exclude the next-run elements from the heap, then the average depth of the heap when inserting a new element is smaller, but all the elements are in the same-run, and we have to invoke the real comparator every time. In other words, those next-run tuples act as dummies which effectively create a heap of uneven depth, and the average depth encountered when inserting tuples turns out to be less than what we get by pulling out the dummies and making the depth uniform. This makes me kind of nervous, because I don't understand why things should work out so well as all that. If throwing some dummy elements into the heap makes things work better, then maybe we should throw in some more. Or maybe it would work better to take some but not all of them out. There certainly doesn't seem to be any indication in the code that this is an anticipated effect, and it seems an awfully providential accident. It may (or may not) be related to the effect that Greg Stark mentions in a nearby post: increasing work_mem makes the heap bigger, so heap maintenance gets more expensive, sometimes without any corresponding benefit. For example, if the input happens to already be sorted, then inserting into the heap is cheap, because each new element is already greater than its parent, and life is good. But removing from the heap is painfully expensive, because to remove element from the heap we stick the last element in the heap into the whole left by the departing element and then sift it down. However, that requires 2*heap_depth comparisons, all of which involve really invoking the comparison function since everything's going to end up going into a single run, and that gets more and more expensive as work_mem increases. So a heap sort of already-sorted data seems to get slower and slower the more memory you give it to work with. For example, a sort of 5 million random strings of length 30, all in shared_buffers, takes 22.03 seconds on my MacBook Pro with work_mem = 128MB, but just 17.08 seconds with work_mem = 1MB. It turns out that it's possible to reduce the number of comparisons required to reinsert the last heap element from 2*heap_depth to heap_depth+lg(heap_depth). At each node, compare the two children, and then let the current node be the lesser of those. When you reach a leaf node, you've walked a path of length heap_depth which is known to be sorted (since we've got a heap), so you can find the correct reinsertion position by binary searching that path. In the case of the sorted data mentioned above, this reduces the time to 19.45 seconds with work_mem = 128MB and 16.12 seconds with work_mem = 1MB. However, in other cases, it seems not to help at all, or even be a slight regression. An obvious weakness of this algorithm is that the linear search we use now can terminate early, if it turns out that the element we're reinserting doesn't need to percolate downward very far, whereas this can't: so the 2*heap_depth bound is worst-case, but the heap_depth+lg(heap_depth) bound is exact. The problem of some comparisons being much cheaper than other also acts to muddy the waters - if there's just one element in the heap destined for the next run, reinserting it will be only about half as expensive, since we'll compare the descendents of each node to each other, but the comparison against the next-run element will be very cheap, and if all the tuples are the same size we may end up reinserting that particular node very frequently. Heikki wrote a patch implementing another "smart" reinsertion algorithm (what we call tuplesort_heap_siftup) which I won't attempt to explain that seems to work even a little bit better than the one mentioned above, but it seems to fall prey to some of the same issues - there are cases where it helps, but there are also cases where it doesn't do much or even regresses slightly, and I don't think either of us understand exactly how to tickle the various cases, let alone what causes them. What does seem clear is that the great bulk of the cost of a heap sort seems to be attributable to tuplesort_heap_siftup, and if we can find something that reduces the number of comparisons required for that operation we will more-or-less proportional speedup in heap-sorting overall. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: