Thread: Why do we still perform a check for pre-sorted input within qsort variants?
Why do we still perform a check for pre-sorted input within qsort variants?
From
Peter Geoghegan
Date:
In the past, Robert and I have criticised the fact that our qsort implementation (and the various specialisations thereof) each perform a check for pre-sorted input. This check does not appear in the original NetBSD qsort that we lifted our implementation from, and presumably isn't described by the document 'Qsort routine from Bentley & McIlroy's "Engineering a Sort Function"' that that implementation is based on. Note that the presorted check occurs on each and every iteration, and not just the first, which was felt to be the really objectionable point [1]. As noted within qsort_arg.c: * Remove ill-considered "swap_cnt" switch to insertion sort,* in favor of a simple check for presorted input. Looking at the archives from around the period that this was committed (October 2006), there doesn't seem to have been any discussion of this particular alteration. Our qsort implementation seems to have rather bad worse-case performance [2] that I do not believe can be accounted for by the usual ways in which quicksort is known to "go quadratic" - we already take various measures against those that make them very unlikely. It looks like Tom was quite right about the swap_cnt optimisation (it's ill-considered), because the NetBSD people recognised this fact in 2009 [3], almost 3 years later. They didn't decide to do something else instead, though. I hope to avoid starting an epic thread on this at this point in the cycle. I'd just like to enquire if it's something that's been given any thought recently. [1] http://www.postgresql.org/message-id/CA+TgmoaGCyGKUN1a6BfYoORbanw5pZZUFv-VASWBumD8T5fdEw@mail.gmail.com [2] http://www.postgresql.org/message-id/CAEYLb_W++UhrcWprzG9TyBVF7Sn-c1s9oLbABvAvPGdeP2DFSQ@mail.gmail.com (A contrived test-case sees qsort_arg perform 2,247,314 comparisons to Timsort's 60,351) [3] http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/stdlib/qsort.c?rev=1.20&content-type=text/x-cvsweb-markup&only_with_tag=MAIN -- Regards, Peter Geoghegan
Peter Geoghegan <peter.geoghegan86@gmail.com> writes: > In the past, Robert and I have criticised the fact that our qsort > implementation (and the various specialisations thereof) each perform > a check for pre-sorted input. This check does not appear in the > original NetBSD qsort that we lifted our implementation from, and > presumably isn't described by the document 'Qsort routine from Bentley > & McIlroy's "Engineering a Sort Function"' that that implementation is > based on. FWIW, I've been suspicious of that pre-sorted check since the day it went in. Bentley was my faculty adviser for awhile in grad school, and I know him to be *way* too smart to have missed anything as simple as that. But I didn't have hard evidence on which to object to it at the time, and indeed testing seemed to say it was a good idea: http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us If you can provide a refutation I will gladly see it out of there. regards, tom lane
I wrote: > FWIW, I've been suspicious of that pre-sorted check since the day it > went in. Bentley was my faculty adviser for awhile in grad school, > and I know him to be *way* too smart to have missed anything as simple > as that. But I didn't have hard evidence on which to object to it > at the time, and indeed testing seemed to say it was a good idea: > http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us BTW, after further review --- one thing that seems a little fishy is that that test scaffolding made glibc's qsort look pretty good; which was at variance with our previous experience, in which our version of qsort seemed to dominate glibc's even before we took out the dubious "swap_cnt" code, cf thread at http://www.postgresql.org/message-id/Pine.LNX.4.58.0512121138080.18520@eon.cs So there is definitely some room to argue that B&M's test scaffolding doesn't match up with our real-world workloads. But before tinkering too much with that code, it'd be good to understand why not, and to have a test case we trust more. regards, tom lane
I wrote: > FWIW, I've been suspicious of that pre-sorted check since the day it > went in. Bentley was my faculty adviser for awhile in grad school, > and I know him to be *way* too smart to have missed anything as simple > as that. But I didn't have hard evidence on which to object to it > at the time, and indeed testing seemed to say it was a good idea: > http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us BTW, after further review --- one thing that seems a little fishy is that that test scaffolding made glibc's qsort look pretty good; which was at variance with our previous experience, in which our version of qsort seemed to dominate glibc's even before we took out the dubious "swap_cnt" code, cf thread at http://www.postgresql.org/message-id/Pine.LNX.4.58.0512121138080.18520@eon.cs So there is definitely some room to argue that B&M's test scaffolding doesn't match up with our real-world workloads. But before tinkering too much with that code, it'd be good to understand why not, and to have a test case we trust more. regards, tom lane
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Robert Haas
Date:
On Mon, Feb 25, 2013 at 5:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <peter.geoghegan86@gmail.com> writes: >> In the past, Robert and I have criticised the fact that our qsort >> implementation (and the various specialisations thereof) each perform >> a check for pre-sorted input. This check does not appear in the >> original NetBSD qsort that we lifted our implementation from, and >> presumably isn't described by the document 'Qsort routine from Bentley >> & McIlroy's "Engineering a Sort Function"' that that implementation is >> based on. > > FWIW, I've been suspicious of that pre-sorted check since the day it > went in. Bentley was my faculty adviser for awhile in grad school, > and I know him to be *way* too smart to have missed anything as simple > as that. But I didn't have hard evidence on which to object to it > at the time, and indeed testing seemed to say it was a good idea: > http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us > > If you can provide a refutation I will gladly see it out of there. The thing that bothers me about that check (which may be different than the thing that bothers Peter) is that we do it recursively at every level. I have a sneaking suspicion that it only makes sense to do that check for subpartitions above a certain size. I'm perfectly willing to grant that presorted data is a sufficiently important special case that we ought to check for it once, and I also suspect it's a good idea to check large partitions so as to speed up the sorting of almost-ordered data. But I'm skeptical that it makes sense to check it at every recursive level. I did attempt to do some tinkering with this while I was playing with it, but I didn't come up with anything really compelling. You can reduce the number of comparisons on particular workloads by tinkering with the algorithm, but then somebody else ends up doing more comparisons, so it's hard to say whether you've really made things better. Or at least I found it so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Peter Geoghegan
Date:
On 25 February 2013 11:49, Robert Haas <robertmhaas@gmail.com> wrote: > I did attempt to do some tinkering with this while I was playing with > it, but I didn't come up with anything really compelling. You can > reduce the number of comparisons on particular workloads by tinkering > with the algorithm, but then somebody else ends up doing more > comparisons, so it's hard to say whether you've really made things > better. Or at least I found it so. Right. To be honest, the real reason that it bothers me is that everything else that our qsort routine does that differs from classic quicksort (mostly quadratic insurance, like the median-of-medians pivot selection, but also the fallback to insertion sort when n < 7) is very well supported by peer reviewed research. Like Tom, I find it implausible that Sedgewick and others missed a trick, where we did not, particularly with something so simple. -- Regards, Peter Geoghegan
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Bruce Momjian
Date:
On Mon, Feb 25, 2013 at 02:31:21PM +0000, Peter Geoghegan wrote: > On 25 February 2013 11:49, Robert Haas <robertmhaas@gmail.com> wrote: > > I did attempt to do some tinkering with this while I was playing with > > it, but I didn't come up with anything really compelling. You can > > reduce the number of comparisons on particular workloads by tinkering > > with the algorithm, but then somebody else ends up doing more > > comparisons, so it's hard to say whether you've really made things > > better. Or at least I found it so. > > Right. > > To be honest, the real reason that it bothers me is that everything > else that our qsort routine does that differs from classic quicksort > (mostly quadratic insurance, like the median-of-medians pivot > selection, but also the fallback to insertion sort when n < 7) is very > well supported by peer reviewed research. Like Tom, I find it > implausible that Sedgewick and others missed a trick, where we did > not, particularly with something so simple. Perhaps we are more likely to be fed sorted data than a typical qsort usecase. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Dann Corbit
Date:
>-----Original Message----- >From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian >Sent: Friday, March 08, 2013 11:22 AM >To: Peter Geoghegan >Cc: Robert Haas; Tom Lane; PG Hackers >Subject: Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants? > >On Mon, Feb 25, 2013 at 02:31:21PM +0000, Peter Geoghegan wrote: >> On 25 February 2013 11:49, Robert Haas <robertmhaas@gmail.com> wrote: >> > I did attempt to do some tinkering with this while I was playing >> > with it, but I didn't come up with anything really compelling. You >> > can reduce the number of comparisons on particular workloads by >> > tinkering with the algorithm, but then somebody else ends up doing >> > more comparisons, so it's hard to say whether you've really made >> > things better. Or at least I found it so. >> >> Right. >> >> To be honest, the real reason that it bothers me is that everything >> else that our qsort routine does that differs from classic quicksort >> (mostly quadratic insurance, like the median-of-medians pivot >> selection, but also the fallback to insertion sort when n < 7) is very >> well supported by peer reviewed research. Like Tom, I find it >> implausible that Sedgewick and others missed a trick, where we did >> not, particularly with something so simple. > >Perhaps we are more likely to be fed sorted data than a typical qsort usecase. > Checking for pre-sorted input will not make the routine faster on average. However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take K*10^12thoperations when the bad condition does occur. Checking for pre-sorted data can be thought of as an insurance policy. It's kind of a pain to pay the premiums, but yousure are glad you have it when an accident occurs. Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant. Switching to insertion sort for small partitions is almost always a good idea. This idea was invented by Richard Singletonas ACM algorithm 347. A different sort of safeguard, as compared to checking for pre-ordered data, is to watch recursion depth. If we find thatwe have already gone past a depth of log2(n) levels and we are not ready to shift to insertion sort, then there is somesort of perverse data set. A traditional fix is to switch to heap sort at this point. Since heap sort is also O(n*log(n)){though the constant multiplier is larger than for quicksort on average} you never get O(n*n) behavior. Thismethod is called introspective sort. There are, of course, other clever things that can be tried. The relaxed weak heapsort of Edelkamp and Steigler, for instance,or using tries as in "Cache-Efficient String Sorting Using Copying" by Ranjan Sinha. There is also possibly significantbenefit to using radix sort for fixed width data that can be easily binned. I seem to recall that a year or two back some study was done on quicksort methodology as used in PostgreSQL. As I recall,the algorithm used in PostgreSQL fared well in the tests.
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
'Bruce Momjian'
Date:
On Fri, Mar 8, 2013 at 07:43:10PM +0000, Dann Corbit wrote: > I seem to recall that a year or two back some study was done on > quicksort methodology as used in PostgreSQL. As I recall, the > algorithm used in PostgreSQL fared well in the tests. Well, that's good to hear. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Peter Geoghegan
Date:
On 8 March 2013 11:48, Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Mar 8, 2013 at 07:43:10PM +0000, Dann Corbit wrote: >> I seem to recall that a year or two back some study was done on >> quicksort methodology as used in PostgreSQL. As I recall, the >> algorithm used in PostgreSQL fared well in the tests. > > Well, that's good to hear. I wouldn't mind taking a look at that. We're not the only ones that use (more or less) that same algorithm. I noticed that Redis does so too, just for example (though they didn't get wise to the problems with the "swap_cnt" pessimisation). -- Regards, Peter Geoghegan
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Dann Corbit
Date:
-----Original Message----- From: Peter Geoghegan [mailto:peter.geoghegan86@gmail.com] Sent: Friday, March 08, 2013 12:00 PM To: Bruce Momjian Cc: Dann Corbit; Robert Haas; Tom Lane; PG Hackers Subject: Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants? On 8 March 2013 11:48, Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Mar 8, 2013 at 07:43:10PM +0000, Dann Corbit wrote: >> I seem to recall that a year or two back some study was done on >> quicksort methodology as used in PostgreSQL. As I recall, the >> algorithm used in PostgreSQL fared well in the tests. > > Well, that's good to hear. I wouldn't mind taking a look at that. We're not the only ones that use (more or less) that same algorithm. I noticed thatRedis does so too, just for example (though they didn't get wise to the problems with the "swap_cnt" pessimisation). >> Turns out, it was a lot longer ago than I thought (2005!). The thread was called "Which qsort is used", but I also seemto recall that the topic was revisited a few times. See, for instance: http://www.postgresql.org/message-id/Pine.LNX.4.58.0512121138080.18520@eon.cs http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html <<
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Greg Stark
Date:
On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit <DCorbit@connx.com> wrote: > Checking for pre-sorted input will not make the routine faster on average. > However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take K*10^12thoperations when the bad condition does occur. > Checking for pre-sorted data can be thought of as an insurance policy. It's kind of a pain to pay the premiums, but yousure are glad you have it when an accident occurs. > Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant. Well pre-sorted inputs are not the only quadratic case. If we really wanted to eliminate quadratic cases we could implement the pivot choice algorithm that guarantees n*log(n) worst-case. But that will increase the average run-time. If we're not doing that then I think your whole argument falls apart. We do care about the average case as well as the worst-case. There's been a *ton* of research on sorting. I find it hard to believe there isn't a pretty solid consensus on which which of these defense mechanisms is the best trade-off. -- greg
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Dann Corbit
Date:
----Original Message----- >From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark >Sent: Friday, March 08, 2013 4:59 PM >To: Dann Corbit >Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers >Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? > >On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit <DCorbit@connx.com> wrote: >> Checking for pre-sorted input will not make the routine faster on average. >> However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take K*10^12thoperations when the bad condition does occur. >> Checking for pre-sorted data can be thought of as an insurance policy. It's kind of a pain to pay the premiums, but yousure are glad you have it when an accident occurs. >> Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant. > >Well pre-sorted inputs are not the only quadratic case. If we really wanted to eliminate quadratic cases we could implementthe pivot choice algorithm that guarantees n*log(n) worst-case. There are three things that give qsort fits: 1. Large ascending partitions. 2. Large descending partitions. 3. Bad choice of the median. If you examine the link in the previous article I gave: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html It has a subfolder located by following this link: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/ In that folder is a file called qsortpdq.c In file qsortpdq.c is a function: void qsortPDQ(void *a, size_t n, size_t es, int (*cmp) (const void *, const void *)); which is an interface that checks for ascending partitions, descending partitions, and does a median of 3 choice for thepivot. Now, even this routine will have very rare inputs that can cause it to go quadratic. The probability of one of these particularinputs is very low. If you want 100% guarantee against quadratic behavior, then qsortpdq can be wrapped in heapsort with a recursion depth check. That sort would never go quadratic. That method is called introspective sort. Here are some links that describe introspective sort: http://en.wikipedia.org/wiki/Introsort https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/heapsort/pages/introspect.html And this is the real thing, right from the horse's mouth: http://www.cs.rpi.edu/~musser/gp/introsort.ps There is no such thing as a quicksort that never goes quadratic. It was formally proven. I cannot find the proof that Ionce read, but the article "A Killer Adversary for Quicksort" by M. D. MCILROY answers pretty nearly. Here is the summary from that paper: SUMMARY Quicksort can be made to go quadratic by constructing input on the fly in response to the sequence of items compared. The technique is illustrated by a specific adversary for the standard C qsort function. The generalmethod works against any implementation of quicksort-even a randomizing one-that satisfies certain very mild and realistic assumptions. >But that will increase the average run-time. If we're not doing that then I think your whole argument falls apart. We docare about the average case as well as the worst-case. The makefile in folder: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/ will build the sort system and test it. The makefile in folder: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html will build the simpler sort system and test it. I would suggest, in addition, the use of this routine to create a tough data set for robustness: http://www.cs.dartmouth.edu/~doug/aqsort.c >There's been a *ton* of research on sorting. I find it hard to believe there isn't a pretty solid consensus on which ofthese defense mechanisms is the best trade-off. I'm partial to chapter 13 of "C Unleashed" in that regard. As far as the trade-offs to, the problem is that they reallyare trade-offs. However, the cost is quite small (benchmark and see) and the risk of not performing the checks isenormous. Mostly sorted data happens all the time in real life, as do other perverse distributions that cause problemsfor sorting algorithms.to believe there isn't a pretty solid consensus on which of these defense mechanisms is thebest trade-off. My own personal recommendation would be to include every single safeguard (all three data checks and a recursion depth checkas well). That's what we do here (I work at a database tools company).
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Greg Stark
Date:
On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit <DCorbit@connx.com> wrote: > There is no such thing as a quicksort that never goes quadratic. It was formally proven The median of medians selection of the pivot gives you O(n*log(n)). -- greg
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Dann Corbit
Date:
-----Original Message----- From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark Sent: Saturday, March 09, 2013 11:39 AM To: Dann Corbit Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit <DCorbit@connx.com> wrote: > There is no such thing as a quicksort that never goes quadratic. It was formally proven The median of medians selection of the pivot gives you O(n*log(n)). >> No. It does make O(n*n) far less probable, but it does not eliminate it. If it were possible, then introspective sortwould be totally without purpose. And yet introspective sort is the algorithm of choice for the ANSI/ISO C++ standarddirectly because it will never go quadratic. Bentley and Mcilroy's "Engineering a Sort Function" says this about their final qsort source code model chosen for theirimplementation in the footnote: "Of course, quadratic behavior is still possible. One can generate fiendish inputs by bugging Quicksort: Consider key values to be unknown initially. In the code for selecting a partition element, assign values in increasing order as unknown keys are encountered. In the partitioning code, make unknown keys compare high." Interestingly, some standard library qsort routines will go quadratic if simply given a set that contains random 0 and 1values for the input set. It has been formally proven that no matter what method used to choose the median (other than calculating the exact medianof every partition which would make quicksort slower than heapsort) there exists an "adversary" distribution that makesquicksort go quadratic. (unfortunately, I can't find the proof or I would give you the citation). Median selectionthat is randomized does not select the true median unless it operates over all the elements of the entire partition. With some small probability, it makes the worst possible choice for the median. It also does not matter if youchoose the first, last and middle elements for a median of three (or other evenly selected points for larger sample sets)or if you select the sample with randomization. Both methods have adversaries. Now, the quickselect algorithm canselect the median in O(n) but that is again an average. There are also adversary distributions for quickselect. On the other hand, various safeguards can make O(n*n) extremely improbable {even probabilities approaching zero}. Thereis a tradeoff with more and more complicated safeguards. If I choose large subsets of a partition and calculate thetrue median of the subset, it will make the routine safer from quadratic behavior but it will also make it slower. Atsome point, you end up with a routine no faster than heapsort, while also being much more complex and therefore more difficultto maintain. Hence, the necessity of introspective sort. On the other hand, there are also data specific sorts that have very special properties. There are cache conscious versionsof burstsort that can sort strings much faster than quicksort or even radix sort according to measurements. Thereis a special version of LSD radix sort that will sort an array in one counting pass and N distribution passes, whereN is the radix. So, for instance, if you are sorting 4 byte integers and your radix is 256 {one byte} then you cansort in 5 passes. One pass to count the bytes in each integer by position, and 4 passes to distribute the data. Thiscan also be done in place. MSD radix sort can also be used as an outer sort container which sorts by distribution untilthe partitions are small enough and then switches to introspective sort (which eventually switches to insertion sort). Radix sorts can also be made totally generic via callback routines like quicksort. Instead of a comparison operator,the callback gets request for the radix bits for a specific bin number and then returns the radix count bits forthat specific bin. For unsigned char and radix 256, this is trivially character [i] from the string for bin [i], forexample. I guess improving sort routines all boils down to how much energy you want to put into it. Donald Knuth once stated thataccording to measurement, about 40% of business related mainframe computer cycles are involved in ordering data. Soit is clearly an important problem. <<
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Dann Corbit
Date:
"A Machine-Checked Proof of the Average-Case Complexity of Quicksort in Coq" By Eelis van der Weegen and James McKinna Institute for Computing and Information Sciences Radboud University Nijmegen Heijendaalseweg 135, 6525 AJ Nijmegen, The Netherlands Contains a formal proof, validated by machine logic, that quicksort is quadratic in the worst case and also a formal proofof the average runtime efficiency.
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Greg Stark
Date:
On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit <DCorbit@connx.com> wrote: > Median of medians selection of the pivot gives you O(n*log(n)). > > No. It does make O(n*n) far less probable, but it does not eliminate it. If it were possible, then introspective sortwould be totally without purpose. No really, quicksort with median of medians pivot selection is most definitely O(n*log(n)) worst case. This is textbook stuff. In fact even the introspective sort paper mentions it as one of the options to fail over to if the partition size isn't decreasing rapidly enough. The problem is that median of medians is O(n) rather than O(1). That doesn't change the O() growth rate since there will be log(n) iterations. But it means it contributes to the constant factor and the end result ends up being a constant factor larger than heap sort or merge sort. That also explains how your reference on the quicksort adversary doesn't apply. It works by ordering elements you haven't compared yet and assumes that all but O(1) elements will still be eligible for reordering. In any case I think you're basically right. What we have is basically a broken introspective sort that does more work than necessary and then handles fewer cases than it should. Implementing a better introspection that detects all perverse cases and does so with a lower overhead than the current check is a fine idea. -- greg
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Dann Corbit
Date:
Yes, you are right. I knew of a median of medians technique for pivot selection and I mistook that for the median of mediansmedian selection algorithm (which it definitely isn't). I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average. I think that you analysis is correct, at any rate. I also think I will enjoy learning and experimenting with the median of medians algorithm. I found a link about it here: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm -----Original Message----- From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark Sent: Saturday, March 09, 2013 1:21 PM To: Dann Corbit Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit <DCorbit@connx.com> wrote: > Median of medians selection of the pivot gives you O(n*log(n)). > > No. It does make O(n*n) far less probable, but it does not eliminate it. If it were possible, then introspective sortwould be totally without purpose. No really, quicksort with median of medians pivot selection is most definitely O(n*log(n)) worst case. This is textbook stuff.In fact even the introspective sort paper mentions it as one of the options to fail over to if the partition size isn'tdecreasing rapidly enough. The problem is that median of medians is O(n) rather than O(1). That doesn't change the O() growth rate since there willbe log(n) iterations. But it means it contributes to the constant factor and the end result ends up being a constantfactor larger than heap sort or merge sort. That also explains how your reference on the quicksort adversary doesn'tapply. It works by ordering elements you haven't compared yet and assumes that all but O(1) elements will still beeligible for reordering. In any case I think you're basically right. What we have is basically a broken introspective sort that does more work thannecessary and then handles fewer cases than it should. Implementing a better introspection that detects all perversecases and does so with a lower overhead than the current check is a fine idea. -- greg
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Greg Stark
Date:
On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit <DCorbit@connx.com> wrote: > Yes, you are right. I knew of a median of medians technique for pivot selection and I mistook that for the median of mediansmedian selection algorithm (which it definitely isn't). > I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average. > I think that you analysis is correct, at any rate. Hm, I was using the terminology differently than the Wikipedia page. I was referring to the recursive median of 5s used as the pivot selection as "median of medians". And I still called Quicksort or Quickselect using that pivot Quicksort or Quickselect with that specific pivot choice algorithm. When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median of Medians on Wikipedia) is O(n). But the constant factor becomes larger than if the pivot choice algorithm is O(1). I suppose it's more interesting in the case of Quickselect since there's no other alternative algorithms that could be used that have better constant factors whereas for sorting we have other options. I wonder if it makes sense to use timsort as the fallback for quicksort if the partition sizes are skewed. Timsort is specifically designed to handle presorted inputs well. On the other hand it is itself a hybrid sort so it might be getting overly complex to make it part of a hybrid algorithm. -- greg
Re: Why do we still perform a check for pre-sorted input within qsort variants?
From
Dann Corbit
Date:
-----Original Message----- From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark Sent: Saturday, March 09, 2013 5:16 PM To: Dann Corbit Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit <DCorbit@connx.com> wrote: > Yes, you are right. I knew of a median of medians technique for pivot selection and I mistook that for the median of mediansmedian selection algorithm (which it definitely isn't). > I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average. > I think that you analysis is correct, at any rate. Hm, I was using the terminology differently than the Wikipedia page. I was referring to the recursive median of 5s used asthe pivot selection as "median of medians". And I still called Quicksort or Quickselect using that pivot Quicksort or Quickselectwith that specific pivot choice algorithm. When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median of Medians on Wikipedia) is O(n). But the constantfactor becomes larger than if the pivot choice algorithm is O(1). I suppose it's more interesting in the case ofQuickselect since there's no other alternative algorithms that could be used that have better constant factors whereasfor sorting we have other options. I wonder if it makes sense to use timsort as the fallback for quicksort if the partition sizes are skewed. Timsort is specificallydesigned to handle presorted inputs well. On the other hand it is itself a hybrid sort so it might be gettingoverly complex to make it part of a hybrid algorithm. -- >> My opinion (and it is no better than anyone else's) is that for a database you have to be very defensive in programming. Since database systems can contain any possible distribution (and over time they likely will encounter almost every possibility)it is important to prevent unusual inputs from causing disaster. The reason we added introspection to the sort here is that we already had quicksort with ordered partition check along witha median of three sample from three medians of three (not to mention the standard recursion of the smallest partitionfirst and switch to insertion at small enough partition size). Sure enough, there was a customer who had a datadistribution that caused bad behavior. We have customers with mainframe data where a single table can be 24 GB (I know this offhand, there are sure to be some thatare larger), so a bad behavior on a sort *will* be noticed. Sorry about the oddball quoting. I will look at fixing it so that my posts are not so hard to grok. <<