Thread: Why do we still perform a check for pre-sorted input within qsort variants?

Why do we still perform a check for pre-sorted input within qsort variants?

From
Peter Geoghegan
Date:
In the past, Robert and I have criticised the fact that our qsort
implementation (and the various specialisations thereof) each perform
a check for pre-sorted input. This check does not appear in the
original NetBSD qsort that we lifted our implementation from, and
presumably isn't described by the document 'Qsort routine from Bentley
& McIlroy's "Engineering a Sort Function"' that that implementation is
based on. Note that the presorted check occurs on each and every
iteration, and not just the first, which was felt to be the really
objectionable point [1]. As noted within qsort_arg.c:
*      Remove ill-considered "swap_cnt" switch to insertion sort,*      in favor of a simple check for presorted
input.

Looking at the archives from around the period that this was committed
(October 2006), there doesn't seem to have been any discussion of this
particular alteration. Our qsort implementation seems to have rather
bad worse-case performance [2] that I do not believe can be accounted
for by the usual ways in which quicksort is known to "go quadratic" -
we already take various measures against those that make them very
unlikely.

It looks like Tom was quite right about the swap_cnt optimisation
(it's ill-considered), because the NetBSD people recognised this fact
in 2009 [3], almost 3 years later. They didn't decide to do something
else instead, though.

I hope to avoid starting an epic thread on this at this point in the
cycle. I'd just like to enquire if it's something that's been given
any thought recently.

[1] http://www.postgresql.org/message-id/CA+TgmoaGCyGKUN1a6BfYoORbanw5pZZUFv-VASWBumD8T5fdEw@mail.gmail.com

[2] http://www.postgresql.org/message-id/CAEYLb_W++UhrcWprzG9TyBVF7Sn-c1s9oLbABvAvPGdeP2DFSQ@mail.gmail.com
(A contrived test-case sees qsort_arg perform 2,247,314 comparisons to
Timsort's 60,351)

[3]
http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/stdlib/qsort.c?rev=1.20&content-type=text/x-cvsweb-markup&only_with_tag=MAIN

-- 
Regards,
Peter Geoghegan



Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> In the past, Robert and I have criticised the fact that our qsort
> implementation (and the various specialisations thereof) each perform
> a check for pre-sorted input. This check does not appear in the
> original NetBSD qsort that we lifted our implementation from, and
> presumably isn't described by the document 'Qsort routine from Bentley
> & McIlroy's "Engineering a Sort Function"' that that implementation is
> based on.

FWIW, I've been suspicious of that pre-sorted check since the day it
went in.  Bentley was my faculty adviser for awhile in grad school,
and I know him to be *way* too smart to have missed anything as simple
as that.  But I didn't have hard evidence on which to object to it
at the time, and indeed testing seemed to say it was a good idea:
http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us

If you can provide a refutation I will gladly see it out of there.
        regards, tom lane



I wrote:
> FWIW, I've been suspicious of that pre-sorted check since the day it
> went in.  Bentley was my faculty adviser for awhile in grad school,
> and I know him to be *way* too smart to have missed anything as simple
> as that.  But I didn't have hard evidence on which to object to it
> at the time, and indeed testing seemed to say it was a good idea:
> http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us

BTW, after further review --- one thing that seems a little fishy is
that that test scaffolding made glibc's qsort look pretty good; which
was at variance with our previous experience, in which our version of
qsort seemed to dominate glibc's even before we took out the dubious
"swap_cnt" code, cf thread at
http://www.postgresql.org/message-id/Pine.LNX.4.58.0512121138080.18520@eon.cs
So there is definitely some room to argue that B&M's test scaffolding
doesn't match up with our real-world workloads.  But before tinkering
too much with that code, it'd be good to understand why not, and to
have a test case we trust more.
        regards, tom lane



I wrote:
> FWIW, I've been suspicious of that pre-sorted check since the day it
> went in.  Bentley was my faculty adviser for awhile in grad school,
> and I know him to be *way* too smart to have missed anything as simple
> as that.  But I didn't have hard evidence on which to object to it
> at the time, and indeed testing seemed to say it was a good idea:
> http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us

BTW, after further review --- one thing that seems a little fishy is
that that test scaffolding made glibc's qsort look pretty good; which
was at variance with our previous experience, in which our version of
qsort seemed to dominate glibc's even before we took out the dubious
"swap_cnt" code, cf thread at
http://www.postgresql.org/message-id/Pine.LNX.4.58.0512121138080.18520@eon.cs
So there is definitely some room to argue that B&M's test scaffolding
doesn't match up with our real-world workloads.  But before tinkering
too much with that code, it'd be good to understand why not, and to
have a test case we trust more.
        regards, tom lane



Re: Why do we still perform a check for pre-sorted input within qsort variants?

From
Robert Haas
Date:
On Mon, Feb 25, 2013 at 5:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
>> In the past, Robert and I have criticised the fact that our qsort
>> implementation (and the various specialisations thereof) each perform
>> a check for pre-sorted input. This check does not appear in the
>> original NetBSD qsort that we lifted our implementation from, and
>> presumably isn't described by the document 'Qsort routine from Bentley
>> & McIlroy's "Engineering a Sort Function"' that that implementation is
>> based on.
>
> FWIW, I've been suspicious of that pre-sorted check since the day it
> went in.  Bentley was my faculty adviser for awhile in grad school,
> and I know him to be *way* too smart to have missed anything as simple
> as that.  But I didn't have hard evidence on which to object to it
> at the time, and indeed testing seemed to say it was a good idea:
> http://www.postgresql.org/message-id/18732.1142967137@sss.pgh.pa.us
>
> If you can provide a refutation I will gladly see it out of there.

The thing that bothers me about that check (which may be different
than the thing that bothers Peter) is that we do it recursively at
every level.  I have a sneaking suspicion that it only makes sense to
do that check for subpartitions above a certain size.  I'm perfectly
willing to grant that presorted data is a sufficiently important
special case that we ought to check for it once, and I also suspect
it's a good idea to check large partitions so as to speed up the
sorting of almost-ordered data.  But I'm skeptical that it makes sense
to check it at every recursive level.

I did attempt to do some tinkering with this while I was playing with
it, but I didn't come up with anything really compelling.  You can
reduce the number of comparisons on particular workloads by tinkering
with the algorithm, but then somebody else ends up doing more
comparisons, so it's hard to say whether you've really made things
better.  Or at least I found it so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why do we still perform a check for pre-sorted input within qsort variants?

From
Peter Geoghegan
Date:
On 25 February 2013 11:49, Robert Haas <robertmhaas@gmail.com> wrote:
> I did attempt to do some tinkering with this while I was playing with
> it, but I didn't come up with anything really compelling.  You can
> reduce the number of comparisons on particular workloads by tinkering
> with the algorithm, but then somebody else ends up doing more
> comparisons, so it's hard to say whether you've really made things
> better.  Or at least I found it so.

Right.

To be honest, the real reason that it bothers me is that everything
else that our qsort routine does that differs from classic quicksort
(mostly quadratic insurance, like the median-of-medians pivot
selection, but also the fallback to insertion sort when n < 7) is very
well supported by peer reviewed research. Like Tom, I find it
implausible that Sedgewick and others missed a trick, where we did
not, particularly with something so simple.


-- 
Regards,
Peter Geoghegan



On Mon, Feb 25, 2013 at 02:31:21PM +0000, Peter Geoghegan wrote:
> On 25 February 2013 11:49, Robert Haas <robertmhaas@gmail.com> wrote:
> > I did attempt to do some tinkering with this while I was playing with
> > it, but I didn't come up with anything really compelling.  You can
> > reduce the number of comparisons on particular workloads by tinkering
> > with the algorithm, but then somebody else ends up doing more
> > comparisons, so it's hard to say whether you've really made things
> > better.  Or at least I found it so.
> 
> Right.
> 
> To be honest, the real reason that it bothers me is that everything
> else that our qsort routine does that differs from classic quicksort
> (mostly quadratic insurance, like the median-of-medians pivot
> selection, but also the fallback to insertion sort when n < 7) is very
> well supported by peer reviewed research. Like Tom, I find it
> implausible that Sedgewick and others missed a trick, where we did
> not, particularly with something so simple.

Perhaps we are more likely to be fed sorted data than a typical qsort
usecase.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



>-----Original Message-----
>From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Bruce Momjian
>Sent: Friday, March 08, 2013 11:22 AM
>To: Peter Geoghegan
>Cc: Robert Haas; Tom Lane; PG Hackers
>Subject: Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?
>
>On Mon, Feb 25, 2013 at 02:31:21PM +0000, Peter Geoghegan wrote:
>> On 25 February 2013 11:49, Robert Haas <robertmhaas@gmail.com> wrote:
>> > I did attempt to do some tinkering with this while I was playing
>> > with it, but I didn't come up with anything really compelling.  You
>> > can reduce the number of comparisons on particular workloads by
>> > tinkering with the algorithm, but then somebody else ends up doing
>> > more comparisons, so it's hard to say whether you've really made
>> > things better.  Or at least I found it so.
>>
>> Right.
>>
>> To be honest, the real reason that it bothers me is that everything
>> else that our qsort routine does that differs from classic quicksort
>> (mostly quadratic insurance, like the median-of-medians pivot
>> selection, but also the fallback to insertion sort when n < 7) is very
>> well supported by peer reviewed research. Like Tom, I find it
>> implausible that Sedgewick and others missed a trick, where we did
>> not, particularly with something so simple.
>
>Perhaps we are more likely to be fed sorted data than a typical qsort usecase.
>
Checking for pre-sorted input will not make the routine faster on average.
However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take
K*10^12thoperations when the bad condition does occur. 
Checking for pre-sorted data can be thought of as an insurance policy.  It's kind of a pain to pay the premiums, but
yousure are glad you have it when an accident occurs. 
Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant.

Switching to insertion sort for small partitions is almost always a good idea.  This idea was invented by Richard
Singletonas ACM algorithm 347. 

A different sort of safeguard, as compared to checking for pre-ordered data, is to watch recursion depth.  If we find
thatwe have already gone past a depth of log2(n) levels and we are not ready to shift to insertion sort, then there is
somesort of perverse data set.  A traditional fix is to switch to heap sort at this point.  Since heap sort is also
O(n*log(n)){though the constant multiplier is larger than for quicksort on average} you never get O(n*n) behavior.
Thismethod is called introspective sort. 

There are, of course, other clever things that can be tried.  The relaxed weak heapsort of Edelkamp and Steigler, for
instance,or using tries as in "Cache-Efficient String Sorting Using Copying" by Ranjan Sinha.  There is also possibly
significantbenefit to using radix sort for fixed width data that can be easily binned. 

I seem to recall that a year or two back some study was done on quicksort methodology as used in PostgreSQL.  As I
recall,the algorithm used in PostgreSQL fared well in the tests. 





Re: Why do we still perform a check for pre-sorted input within qsort variants?

From
'Bruce Momjian'
Date:
On Fri, Mar 8, 2013 at 07:43:10PM +0000, Dann Corbit wrote:
> I seem to recall that a year or two back some study was done on
> quicksort methodology as used in PostgreSQL.  As I recall, the
> algorithm used in PostgreSQL fared well in the tests.

Well, that's good to hear.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Why do we still perform a check for pre-sorted input within qsort variants?

From
Peter Geoghegan
Date:
On 8 March 2013 11:48, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Mar 8, 2013 at 07:43:10PM +0000, Dann Corbit wrote:
>> I seem to recall that a year or two back some study was done on
>> quicksort methodology as used in PostgreSQL.  As I recall, the
>> algorithm used in PostgreSQL fared well in the tests.
>
> Well, that's good to hear.

I wouldn't mind taking a look at that. We're not the only ones that
use (more or less) that same algorithm. I noticed that Redis does so
too, just for example (though they didn't get wise to the problems
with the "swap_cnt" pessimisation).

-- 
Regards,
Peter Geoghegan



-----Original Message-----
From: Peter Geoghegan [mailto:peter.geoghegan86@gmail.com]
Sent: Friday, March 08, 2013 12:00 PM
To: Bruce Momjian
Cc: Dann Corbit; Robert Haas; Tom Lane; PG Hackers
Subject: Re: [HACKERS] Why do we still perform a check for pre-sorted input within qsort variants?

On 8 March 2013 11:48, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Mar 8, 2013 at 07:43:10PM +0000, Dann Corbit wrote:
>> I seem to recall that a year or two back some study was done on
>> quicksort methodology as used in PostgreSQL.  As I recall, the
>> algorithm used in PostgreSQL fared well in the tests.
>
> Well, that's good to hear.

I wouldn't mind taking a look at that. We're not the only ones that use (more or less) that same algorithm. I noticed
thatRedis does so too, just for example (though they didn't get wise to the problems with the "swap_cnt"
pessimisation).

>>
Turns out, it was a lot longer ago than I thought (2005!).  The thread was called "Which qsort is used", but I also
seemto recall that the topic was revisited a few times.  See, for instance: 

http://www.postgresql.org/message-id/Pine.LNX.4.58.0512121138080.18520@eon.cs
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html

<<



On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit <DCorbit@connx.com> wrote:
> Checking for pre-sorted input will not make the routine faster on average.
> However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take
K*10^12thoperations when the bad condition does occur.
 
> Checking for pre-sorted data can be thought of as an insurance policy.  It's kind of a pain to pay the premiums, but
yousure are glad you have it when an accident occurs.
 
> Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant.

Well pre-sorted inputs are not the only quadratic case. If we really
wanted to eliminate quadratic cases we could implement the pivot
choice algorithm that guarantees n*log(n) worst-case. But that will
increase the average run-time. If we're not doing that then I think
your whole argument falls apart. We do care about the average case as
well as the worst-case.

There's been a *ton* of research on sorting. I find it hard to believe
there isn't a pretty solid consensus on which which of these defense
mechanisms is the best trade-off.


-- 
greg



----Original Message-----
>From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
>Sent: Friday, March 08, 2013 4:59 PM
>To: Dann Corbit
>Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
>Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
>
>On Fri, Mar 8, 2013 at 7:43 PM, Dann Corbit <DCorbit@connx.com> wrote:
>> Checking for pre-sorted input will not make the routine faster on average.
>> However, it prevents a common perverse case where the runtime goes quadratic, so sorting 10^6 elements will take
K*10^12thoperations when the bad condition does occur. 
>> Checking for pre-sorted data can be thought of as an insurance policy.  It's kind of a pain to pay the premiums, but
yousure are glad you have it when an accident occurs. 
>> Because the check is linear, and the algorithm is O(n*log(n)), the cost is not terribly significant.
>
>Well pre-sorted inputs are not the only quadratic case. If we really wanted to eliminate quadratic cases we could
implementthe pivot choice algorithm that guarantees n*log(n) worst-case.  

There are three things that give qsort fits:
1.  Large ascending partitions.
2.  Large descending partitions.
3.  Bad choice of the median.
If you examine the link in the previous article I gave:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
It has a subfolder located by following this link:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
In that folder is a file called qsortpdq.c
In file qsortpdq.c is a function:
void qsortPDQ(void *a, size_t n, size_t es, int (*cmp) (const void *, const void *));
which is an interface that checks for ascending partitions, descending partitions, and does a median of 3 choice for
thepivot. 
Now, even this routine will have very rare inputs that can cause it to go quadratic.  The probability of one of these
particularinputs is very low. 

If you want 100% guarantee against quadratic behavior, then qsortpdq can be wrapped in heapsort with a recursion depth
check. That sort would never go quadratic.  That method is called introspective sort. 
Here are some links that describe introspective sort:
http://en.wikipedia.org/wiki/Introsort
https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/heapsort/pages/introspect.html
And this is the real thing, right from the horse's mouth:
http://www.cs.rpi.edu/~musser/gp/introsort.ps

There is no such thing as a quicksort that never goes quadratic.  It was formally proven.  I cannot find the proof that
Ionce read, but the article "A Killer Adversary for Quicksort" by M. D. MCILROY answers pretty nearly. 

Here is the summary from that paper:

SUMMARY
Quicksort can be made to go quadratic by constructing input on the fly in response to the sequence of
items compared. The technique is illustrated by a specific adversary for the standard C qsort function.
The generalmethod works against any implementation of quicksort-even a randomizing one-that satisfies
certain very mild and realistic assumptions.

>But that will increase the average run-time. If we're not doing that then I think your whole argument falls apart. We
docare about the average case as well as the worst-case. 
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/dann/
will build the sort system and test it.
The makefile in folder:
http://www.cs.toronto.edu/~zhouqq/postgresql/sort/sort.html
will build the simpler sort system and test it.

I would suggest, in addition, the use of this routine to create a tough data set for robustness:
http://www.cs.dartmouth.edu/~doug/aqsort.c

>There's been a *ton* of research on sorting. I find it hard to believe there isn't a pretty solid consensus on which
ofthese defense mechanisms is the best trade-off. 
I'm partial to chapter 13 of "C Unleashed" in that regard.  As far as the trade-offs to, the problem is that they
reallyare trade-offs.  However, the cost is quite small (benchmark and see) and the risk of not performing the checks
isenormous.  Mostly sorted data happens all the time in real life, as do other perverse distributions that cause
problemsfor sorting algorithms.to believe there isn't a pretty solid consensus on which of these defense mechanisms is
thebest trade-off. 

My own personal recommendation would be to include every single safeguard (all three data checks and a recursion depth
checkas well). 
That's what we do here (I work at a database tools company).




On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit <DCorbit@connx.com> wrote:
> There is no such thing as a quicksort that never goes quadratic.  It was formally proven

The median of medians selection of the pivot gives you O(n*log(n)).

-- 
greg



-----Original Message-----
From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 11:39 AM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?

On Sat, Mar 9, 2013 at 10:32 AM, Dann Corbit <DCorbit@connx.com> wrote:
> There is no such thing as a quicksort that never goes quadratic.  It was formally proven

The median of medians selection of the pivot gives you O(n*log(n)).

>>
No.  It does make O(n*n)  far less probable, but it does not eliminate it.  If it were possible, then introspective
sortwould be totally without purpose. And yet introspective sort is the algorithm of choice for the ANSI/ISO C++
standarddirectly because it will never go quadratic. 
Bentley and Mcilroy's  "Engineering a Sort Function" says this about their final qsort source code model chosen for
theirimplementation in the footnote: 
"Of course, quadratic behavior is still possible. One can generate fiendish inputs by bugging Quicksort: Consider
key values to be unknown initially. In the code for selecting a partition element, assign values in increasing
order as unknown keys are encountered. In the partitioning code, make unknown keys compare high."

Interestingly, some standard library qsort routines will go quadratic if simply given a set that contains random 0 and
1values for the input set. 

It has been formally proven that no matter what method used to choose the median (other than calculating the exact
medianof every partition which would make quicksort slower than heapsort) there exists an "adversary" distribution that
makesquicksort go quadratic. (unfortunately, I can't find the proof or I would give you the citation).  Median
selectionthat is randomized does not select the true median unless it operates over all the elements of the entire
partition. With some small probability, it makes the worst possible choice for the median.  It also does not matter if
youchoose the first, last and middle elements for a median of three (or other evenly selected points for larger sample
sets)or if you select the sample with randomization.  Both methods have adversaries.  Now, the quickselect algorithm
canselect the median in O(n) but that is again an average.  There are also adversary distributions for quickselect. 

On the other hand, various safeguards can make O(n*n) extremely improbable {even probabilities approaching zero}.
Thereis a tradeoff with more and more complicated safeguards.  If I choose large subsets of a partition and calculate
thetrue median of the subset, it will make the routine safer from quadratic behavior but it will also make it slower.
Atsome point, you end up with a routine no faster than heapsort, while also being much more complex and therefore more
difficultto maintain.  Hence, the necessity of introspective sort. 

On the other hand, there are also data specific sorts that have very special properties.  There are cache conscious
versionsof burstsort that can sort strings much faster than quicksort or even radix sort according to measurements.
Thereis a special version of LSD radix sort that will sort an array in one counting pass and N distribution passes,
whereN is the radix.  So, for instance, if you are sorting 4 byte integers and your radix is 256 {one byte} then you
cansort in 5 passes.  One pass to count the bytes in each integer by position, and 4 passes to distribute the data.
Thiscan also be done in place.  MSD radix sort can also be used as an outer sort container which sorts by distribution
untilthe partitions are small enough and then switches to introspective sort (which eventually switches to insertion
sort). Radix sorts can also be made totally generic via callback routines like quicksort.  Instead of a comparison
operator,the callback gets request for the radix bits for a specific bin number and then returns the radix count bits
forthat specific bin.  For unsigned char and radix 256, this is trivially character [i] from the string for bin [i],
forexample. 

I guess improving sort routines all boils down to how much energy you want to put into it.  Donald Knuth once stated
thataccording to measurement, about 40% of business related mainframe computer cycles are involved in ordering data.
Soit is clearly an important problem. 
<<



"A Machine-Checked Proof of the Average-Case Complexity of Quicksort in Coq"
By Eelis van der Weegen and James McKinna
Institute for Computing and Information Sciences
Radboud University Nijmegen
Heijendaalseweg 135, 6525 AJ Nijmegen, The Netherlands

Contains a formal proof, validated by machine logic, that quicksort is quadratic in the worst case and also a formal
proofof the average runtime efficiency. 




On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit <DCorbit@connx.com> wrote:
> Median of medians selection of the pivot gives you O(n*log(n)).
>
> No.  It does make O(n*n)  far less probable, but it does not eliminate it.  If it were possible, then introspective
sortwould be totally without purpose.
 

No really, quicksort with median of medians pivot selection is most
definitely O(n*log(n)) worst case. This is textbook stuff. In fact even the
introspective sort paper mentions it as one of the options to fail
over to if the partition size isn't decreasing rapidly enough.

The problem is that median of medians is O(n) rather than O(1). That
doesn't change the O() growth rate since there will be log(n)
iterations. But it means it contributes to the constant factor and the
end result ends up being a constant factor larger than heap sort or
merge sort. That also explains how your reference on the quicksort
adversary doesn't apply. It works by ordering elements you haven't
compared yet and assumes that all but O(1) elements will still be
eligible for reordering.

In any case I think you're basically right. What we have is basically
a broken introspective sort that does more work than necessary and
then handles fewer cases than it should. Implementing a better
introspection that detects all perverse cases and does so with a lower
overhead than the current check is a fine idea.


-- 
greg



Yes, you are right.  I knew of a median of medians technique for pivot selection and I mistook that for the median of
mediansmedian selection algorithm (which it definitely isn't). 
I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes).
The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average. 
I think that you analysis is correct, at any rate.

I also think I will enjoy learning and experimenting with the median of medians algorithm.
I found a link about it here:
http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm

-----Original Message-----
From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 1:21 PM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?

On Sat, Mar 9, 2013 at 8:52 PM, Dann Corbit <DCorbit@connx.com> wrote:
> Median of medians selection of the pivot gives you O(n*log(n)).
>
> No.  It does make O(n*n)  far less probable, but it does not eliminate it.  If it were possible, then introspective
sortwould be totally without purpose. 

No really, quicksort with median of medians pivot selection is most definitely O(n*log(n)) worst case. This is textbook
stuff.In fact even the introspective sort paper mentions it as one of the options to fail over to if the partition size
isn'tdecreasing rapidly enough. 

The problem is that median of medians is O(n) rather than O(1). That doesn't change the O() growth rate since there
willbe log(n) iterations. But it means it contributes to the constant factor and the end result ends up being a
constantfactor larger than heap sort or merge sort. That also explains how your reference on the quicksort adversary
doesn'tapply. It works by ordering elements you haven't compared yet and assumes that all but O(1) elements will still
beeligible for reordering. 

In any case I think you're basically right. What we have is basically a broken introspective sort that does more work
thannecessary and then handles fewer cases than it should. Implementing a better introspection that detects all
perversecases and does so with a lower overhead than the current check is a fine idea. 


--
greg



On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit <DCorbit@connx.com> wrote:
> Yes, you are right.  I knew of a median of medians technique for pivot selection and I mistook that for the median of
mediansmedian selection algorithm (which it definitely isn't).
 
> I was not aware of a true linear time selection of the median algorithm {which is what median of medians
accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on
average.
> I think that you analysis is correct, at any rate.

Hm, I was using the terminology differently than the Wikipedia page. I
was referring to the recursive median of 5s used as the pivot
selection as "median of medians". And I still called Quicksort or
Quickselect using that pivot Quicksort or Quickselect with that
specific pivot choice algorithm.

When using that pivot choice Quicksort is O(n*log(n)) and Quickselect
(Median of Medians on Wikipedia) is O(n). But the constant factor
becomes larger than if the pivot choice algorithm is O(1). I suppose
it's more interesting in the case of Quickselect since there's no
other alternative algorithms that could be used that have better
constant factors whereas for sorting we have other options.


I wonder if it makes sense to use timsort as the fallback for
quicksort if the partition sizes are skewed. Timsort is specifically
designed to handle presorted inputs well.  On the other hand it is
itself a hybrid sort so it might be getting overly complex to make it
part of a hybrid algorithm.

-- 
greg



-----Original Message-----
From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 5:16 PM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?

On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit <DCorbit@connx.com> wrote:
> Yes, you are right.  I knew of a median of medians technique for pivot selection and I mistook that for the median of
mediansmedian selection algorithm (which it definitely isn't). 
> I was not aware of a true linear time selection of the median algorithm {which is what median of medians
accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on
average.
> I think that you analysis is correct, at any rate.

Hm, I was using the terminology differently than the Wikipedia page. I was referring to the recursive median of 5s used
asthe pivot selection as "median of medians". And I still called Quicksort or Quickselect using that pivot Quicksort or
Quickselectwith that specific pivot choice algorithm. 

When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median of Medians on Wikipedia) is O(n). But the
constantfactor becomes larger than if the pivot choice algorithm is O(1). I suppose it's more interesting in the case
ofQuickselect since there's no other alternative algorithms that could be used that have better constant factors
whereasfor sorting we have other options. 


I wonder if it makes sense to use timsort as the fallback for quicksort if the partition sizes are skewed. Timsort is
specificallydesigned to handle presorted inputs well.  On the other hand it is itself a hybrid sort so it might be
gettingoverly complex to make it part of a hybrid algorithm. 

--
>>
My opinion (and it is no better than anyone else's) is that for a database you have to be very defensive in
programming.
Since database systems can contain any possible distribution (and over time they likely will encounter almost every
possibility)it is important to prevent unusual inputs from causing disaster. 

The reason we added introspection to the sort here is that we already had quicksort with ordered partition check along
witha median of three sample from three medians of three (not to mention the standard recursion of the smallest
partitionfirst and switch to insertion at small enough partition size).  Sure enough, there was a customer who had a
datadistribution that caused bad behavior. 

We have customers with mainframe data where a single table can be 24 GB (I know this offhand, there are sure to be some
thatare larger), so a bad behavior on a sort *will* be noticed. 

Sorry about the oddball quoting.  I will look at fixing it so that my posts are not so hard to grok.
<<