Re: Why do we still perform a check for pre-sorted input within qsort variants? - Mailing list pgsql-hackers
From | Dann Corbit |
---|---|
Subject | Re: Why do we still perform a check for pre-sorted input within qsort variants? |
Date | |
Msg-id | 87F42982BF2B434F831FCEF4C45FC33E5BD367BB@EXCHANGE.corporate.connx.com Whole thread Raw |
In response to | Re: Why do we still perform a check for pre-sorted input within qsort variants? (Greg Stark <stark@mit.edu>) |
Responses |
Re: Why do we still perform a check for pre-sorted input within qsort variants?
|
List | pgsql-hackers |
----Original Message----- >From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark >Sent: Friday, March 08, 2013 4:59 PM >To: Dann Corbit >Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers >Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants? > >On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit <DCorbit@connx.com> wrote: >> Checking for pre-sorted input will not make the routine faster on average. >> However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take K*10^12thoperations when the bad condition does occur. >> Checking for pre-sorted data can be thought of as an insurance policy. It's kind of a pain to pay the premiums, but yousure are glad you have it when an accident occurs. >> Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant. > >Well pre-sorted inputs are not the only quadratic case. If we really wanted to eliminate quadratic cases we could implementthe pivot choice algorithm that guarantees n*log(n) worst-case. There are three things that give qsort fits: 1. Large ascending partitions. 2. Large descending partitions. 3. Bad choice of the median. If you examine the link in the previous article I gave: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html It has a subfolder located by following this link: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/ In that folder is a file called qsortpdq.c In file qsortpdq.c is a function: void qsortPDQ(void *a, size_t n, size_t es, int (*cmp) (const void *, const void *)); which is an interface that checks for ascending partitions, descending partitions, and does a median of 3 choice for thepivot. Now, even this routine will have very rare inputs that can cause it to go quadratic. The probability of one of these particularinputs is very low. If you want 100% guarantee against quadratic behavior, then qsortpdq can be wrapped in heapsort with a recursion depth check. That sort would never go quadratic. That method is called introspective sort. Here are some links that describe introspective sort: http://en.wikipedia.org/wiki/Introsort https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/heapsort/pages/introspect.html And this is the real thing, right from the horse's mouth: http://www.cs.rpi.edu/~musser/gp/introsort.ps There is no such thing as a quicksort that never goes quadratic. It was formally proven. I cannot find the proof that Ionce read, but the article "A Killer Adversary for Quicksort" by M. D. MCILROY answers pretty nearly. Here is the summary from that paper: SUMMARY Quicksort can be made to go quadratic by constructing input on the fly in response to the sequence of items compared. The technique is illustrated by a specific adversary for the standard C qsort function. The generalmethod works against any implementation of quicksort-even a randomizing one-that satisfies certain very mild and realistic assumptions. >But that will increase the average run-time. If we're not doing that then I think your whole argument falls apart. We docare about the average case as well as the worst-case. The makefile in folder: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/ will build the sort system and test it. The makefile in folder: http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html will build the simpler sort system and test it. I would suggest, in addition, the use of this routine to create a tough data set for robustness: http://www.cs.dartmouth.edu/~doug/aqsort.c >There's been a *ton* of research on sorting. I find it hard to believe there isn't a pretty solid consensus on which ofthese defense mechanisms is the best trade-off. I'm partial to chapter 13 of "C Unleashed" in that regard. As far as the trade-offs to, the problem is that they reallyare trade-offs. However, the cost is quite small (benchmark and see) and the risk of not performing the checks isenormous. Mostly sorted data happens all the time in real life, as do other perverse distributions that cause problemsfor sorting algorithms.to believe there isn't a pretty solid consensus on which of these defense mechanisms is thebest trade-off. My own personal recommendation would be to include every single safeguard (all three data checks and a recursion depth checkas well). That's what we do here (I work at a database tools company).
pgsql-hackers by date: