Re: Memory usage during sorting - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Memory usage during sorting |
Date | |
Msg-id | CA+TgmoZy-sYZSbHFdNAE6ZN44FGcaPb4mBuP1fJ-2R0xOghTkQ@mail.gmail.com Whole thread Raw |
In response to | Re: Memory usage during sorting (Peter Geoghegan <peter@2ndquadrant.com>) |
Responses |
Re: Memory usage during sorting
|
List | pgsql-hackers |
On Fri, Apr 13, 2012 at 1:15 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 13 April 2012 17:42, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> One insight that I had at the time was that text comparisons where the >> c locale isn't used are really rather expensive, and I doubt that >> there is too much that can be done to address that directly. However, >> if we were to support timsort, that could substantially cut down on >> the number of comparisons for text columns, which could be a real win. >> Maybe that would happen through some explicit mechanism, or maybe the >> planner could actually infer that it was likely to be optimal to use >> timsort based on a high statistical correlation between physical row >> ordering and logical ordering of a key column's values. > > Further thoughts: > > At the time we committed our own quicksort implementation, based on > the NetBSD one, we eschewed the optimisation of using insertion sort > when n is fairly low. This happens to be a very common optimisation, > so I'm not really super-confident that that was a good decision. > However, we also added our own optimisation, which is to attempt, > regardless of the size of n, to ascertain that the array is > pre-sorted, in which case we don't quicksort at all. We do use insertion sort for partitions of size < 7. I assume you are referring to something else. One thing that has struck me as odd is that we check for presorted arrays at every level of recursion. That is, if we start out with 100 elements, we'll check whether the input is presorted. Then, if we partition it into a block of 60 elements and a block of 40 elements, we'll check each of those partitions over again to see whether they are presorted. There is definitely some potential upside here, because if the input is mostly sorted it may contain runs where everything is in order, and we'll detect that and avoid some work. But it's also kind of expensive; I've wondered whether we should, for example, move the test for presorted input down a bit, so that it only happens in the n > 40 case. Another idea is to arrange things so that we remember the offset at which the last presort check failed. When we tail-recurse into the left half of the array, there's no need to recheck - if the presort check fell out after our partition boundary, it's presorted; if it fell out before the partition boundary, it isn't. > So if we attempt to quicksort an array which is almost pre-sorted, but > say only has its very last element out-of-order, we'll do fairly > horribly, because we waste the effort of almost an entire "bubble sort > iteration". So almost-sorted data would seem to be a compelling thing > to optimise for that reason, as well as for the simple observation > that it isn't exactly uncommon in a relational database. Heap-sorting could also benefit from some optimization in this area. Right now, if you heap-sort data that's already in order, it gets progressively slower as you crank up work_mem, because the heap gets deeper and so extraction and siftup get more expensive, for no real benefit. We could do something like check, every time we use up our available memory, whether the heap is already entirely in order. If so, we could dump all but one tuple to the current run and the start reading tuples again. Or maybe dump some percentage of the array, though that might require a bit more bookkeeping. Or maybe there's a smarter algorithm that would also cater to mostly-sorted data. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: