Re: Why do we still perform a check for pre-sorted input within qsort variants? - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: Why do we still perform a check for pre-sorted input within qsort variants?
Date
Msg-id 87F42982BF2B434F831FCEF4C45FC33E5BD367BB@EXCHANGE.corporate.connx.com
Whole thread Raw
In response to Re: Why do we still perform a check for pre-sorted input within qsort variants?  (Greg Stark <stark@mit.edu>)
Responses Re: Why do we still perform a check for pre-sorted input within qsort variants?  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
----Original Message-----
>From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
>Sent: Friday, March 08, 2013 4:59 PM
>To: Dann Corbit
>Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
>Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
>
>On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit <DCorbit@connx.com> wrote:
>> Checking for pre-sorted input will not make the routine faster on average.
>> However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take
K*10^12thoperations when the bad condition does occur. 
>> Checking for pre-sorted data can be thought of as an insurance policy.  It's kind of a pain to pay the premiums, but
yousure are glad you have it when an accident occurs. 
>> Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant.
>
>Well pre-sorted inputs are not the only quadratic case. If we really wanted to eliminate quadratic cases we could
implementthe pivot choice algorithm that guarantees n*log(n) worst-case.  

There are three things that give qsort fits:
1.  Large ascending partitions.
2.  Large descending partitions.
3.  Bad choice of the median.
If you examine the link in the previous article I gave:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
It has a subfolder located by following this link:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
In that folder is a file called qsortpdq.c
In file qsortpdq.c is a function:
void qsortPDQ(void *a, size_t n, size_t es, int (*cmp) (const void *, const void *));
which is an interface that checks for ascending partitions, descending partitions, and does a median of 3 choice for
thepivot. 
Now, even this routine will have very rare inputs that can cause it to go quadratic.  The probability of one of these
particularinputs is very low. 

If you want 100% guarantee against quadratic behavior, then qsortpdq can be wrapped in heapsort with a recursion depth
check. That sort would never go quadratic.  That method is called introspective sort. 
Here are some links that describe introspective sort:
http://en.wikipedia.org/wiki/Introsort
https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/heapsort/pages/introspect.html
And this is the real thing, right from the horse's mouth:
http://www.cs.rpi.edu/~musser/gp/introsort.ps

There is no such thing as a quicksort that never goes quadratic.  It was formally proven.  I cannot find the proof that
Ionce read, but the article "A Killer Adversary for Quicksort" by M. D. MCILROY answers pretty nearly. 

Here is the summary from that paper:

SUMMARY
Quicksort can be made to go quadratic by constructing input on the fly in response to the sequence of
items compared. The technique is illustrated by a specific adversary for the standard C qsort function.
The generalmethod works against any implementation of quicksort-even a randomizing one-that satisfies
certain very mild and realistic assumptions.

>But that will increase the average run-time. If we're not doing that then I think your whole argument falls apart. We
docare about the average case as well as the worst-case. 
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
will build the sort system and test it.
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
will build the simpler sort system and test it.

I would suggest, in addition, the use of this routine to create a tough data set for robustness:
http://www.cs.dartmouth.edu/~doug/aqsort.c

>There's been a *ton* of research on sorting. I find it hard to believe there isn't a pretty solid consensus on which
ofthese defense mechanisms is the best trade-off. 
I'm partial to chapter 13 of "C Unleashed" in that regard.  As far as the trade-offs to, the problem is that they
reallyare trade-offs.  However, the cost is quite small (benchmark and see) and the risk of not performing the checks
isenormous.  Mostly sorted data happens all the time in real life, as do other perverse distributions that cause
problemsfor sorting algorithms.to believe there isn't a pretty solid consensus on which of these defense mechanisms is
thebest trade-off. 

My own personal recommendation would be to include every single safeguard (all three data checks and a recursion depth
checkas well). 
That's what we do here (I work at a database tools company).




pgsql-hackers by date:

Previous
From: Ian Pilcher
Date:
Subject: Re: Trust intermediate CA for client certificates
Next
From: Amit kapila
Date:
Subject: Re: Identity projection