Re: Why do we still perform a check for pre-sorted input within qsort variants? - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: Why do we still perform a check for pre-sorted input within qsort variants?
Date
Msg-id 87F42982BF2B434F831FCEF4C45FC33E5BD369DF@EXCHANGE.corporate.connx.com
Whole thread Raw
In response to Re: Why do we still perform a check for pre-sorted input within qsort variants?  (Greg Stark <stark@mit.edu>)
Responses Re: Why do we still perform a check for pre-sorted input within qsort variants?  (Dann Corbit <DCorbit@connx.com>)
Re: Why do we still perform a check for pre-sorted input within qsort variants?  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
-----Original Message-----
From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 11:39 AM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?

On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit <DCorbit@connx.com> wrote:
> There is no such thing as a quicksort that never goes quadratic.  It was formally proven

The median of medians selection of the pivot gives you O(n*log(n)).

>>
No.  It does make O(n*n)  far less probable, but it does not eliminate it.  If it were possible, then introspective
sortwould be totally without purpose. And yet introspective sort is the algorithm of choice for the ANSI/ISO C++
standarddirectly because it will never go quadratic. 
Bentley and Mcilroy's  "Engineering a Sort Function" says this about their final qsort source code model chosen for
theirimplementation in the footnote: 
"Of course, quadratic behavior is still possible. One can generate fiendish inputs by bugging Quicksort: Consider
key values to be unknown initially. In the code for selecting a partition element, assign values in increasing
order as unknown keys are encountered. In the partitioning code, make unknown keys compare high."

Interestingly, some standard library qsort routines will go quadratic if simply given a set that contains random 0 and
1values for the input set. 

It has been formally proven that no matter what method used to choose the median (other than calculating the exact
medianof every partition which would make quicksort slower than heapsort) there exists an "adversary" distribution that
makesquicksort go quadratic. (unfortunately, I can't find the proof or I would give you the citation).  Median
selectionthat is randomized does not select the true median unless it operates over all the elements of the entire
partition. With some small probability, it makes the worst possible choice for the median.  It also does not matter if
youchoose the first, last and middle elements for a median of three (or other evenly selected points for larger sample
sets)or if you select the sample with randomization.  Both methods have adversaries.  Now, the quickselect algorithm
canselect the median in O(n) but that is again an average.  There are also adversary distributions for quickselect. 

On the other hand, various safeguards can make O(n*n) extremely improbable {even probabilities approaching zero}.
Thereis a tradeoff with more and more complicated safeguards.  If I choose large subsets of a partition and calculate
thetrue median of the subset, it will make the routine safer from quadratic behavior but it will also make it slower.
Atsome point, you end up with a routine no faster than heapsort, while also being much more complex and therefore more
difficultto maintain.  Hence, the necessity of introspective sort. 

On the other hand, there are also data specific sorts that have very special properties.  There are cache conscious
versionsof burstsort that can sort strings much faster than quicksort or even radix sort according to measurements.
Thereis a special version of LSD radix sort that will sort an array in one counting pass and N distribution passes,
whereN is the radix.  So, for instance, if you are sorting 4 byte integers and your radix is 256 {one byte} then you
cansort in 5 passes.  One pass to count the bytes in each integer by position, and 4 passes to distribute the data.
Thiscan also be done in place.  MSD radix sort can also be used as an outer sort container which sorts by distribution
untilthe partitions are small enough and then switches to introspective sort (which eventually switches to insertion
sort). Radix sorts can also be made totally generic via callback routines like quicksort.  Instead of a comparison
operator,the callback gets request for the radix bits for a specific bin number and then returns the radix count bits
forthat specific bin.  For unsigned char and radix 256, this is trivially character [i] from the string for bin [i],
forexample. 

I guess improving sort routines all boils down to how much energy you want to put into it.  Donald Knuth once stated
thataccording to measurement, about 40% of business related mainframe computer cycles are involved in ordering data.
Soit is clearly an important problem. 
<<



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Support for REINDEX CONCURRENTLY
Next
From: Dann Corbit
Date:
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?