Re: Why do we still perform a check for pre-sorted input within qsort variants? - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: Why do we still perform a check for pre-sorted input within qsort variants?
Date
Msg-id 87F42982BF2B434F831FCEF4C45FC33E5BD36AE2@EXCHANGE.corporate.connx.com
Whole thread Raw
In response to Re: Why do we still perform a check for pre-sorted input within qsort variants?  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
-----Original Message-----
From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 5:16 PM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?

On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit <DCorbit@connx.com> wrote:
> Yes, you are right.  I knew of a median of medians technique for pivot selection and I mistook that for the median of
mediansmedian selection algorithm (which it definitely isn't). 
> I was not aware of a true linear time selection of the median algorithm {which is what median of medians
accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on
average.
> I think that you analysis is correct, at any rate.

Hm, I was using the terminology differently than the Wikipedia page. I was referring to the recursive median of 5s used
asthe pivot selection as "median of medians". And I still called Quicksort or Quickselect using that pivot Quicksort or
Quickselectwith that specific pivot choice algorithm. 

When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median of Medians on Wikipedia) is O(n). But the
constantfactor becomes larger than if the pivot choice algorithm is O(1). I suppose it's more interesting in the case
ofQuickselect since there's no other alternative algorithms that could be used that have better constant factors
whereasfor sorting we have other options. 


I wonder if it makes sense to use timsort as the fallback for quicksort if the partition sizes are skewed. Timsort is
specificallydesigned to handle presorted inputs well.  On the other hand it is itself a hybrid sort so it might be
gettingoverly complex to make it part of a hybrid algorithm. 

--
>>
My opinion (and it is no better than anyone else's) is that for a database you have to be very defensive in
programming.
Since database systems can contain any possible distribution (and over time they likely will encounter almost every
possibility)it is important to prevent unusual inputs from causing disaster. 

The reason we added introspection to the sort here is that we already had quicksort with ordered partition check along
witha median of three sample from three medians of three (not to mention the standard recursion of the smallest
partitionfirst and switch to insertion at small enough partition size).  Sure enough, there was a customer who had a
datadistribution that caused bad behavior. 

We have customers with mainframe data where a single table can be 24 GB (I know this offhand, there are sure to be some
thatare larger), so a bad behavior on a sort *will* be noticed. 

Sorry about the oddball quoting.  I will look at fixing it so that my posts are not so hard to grok.
<<



pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
Next
From: Jonathan Rogers
Date:
Subject: Re: Btrfs clone WIP patch