Thread: Re: [HACKERS] qsort again (was Re: Strange Create Index behaviour)

Re: [HACKERS] qsort again (was Re: Strange Create Index behaviour)

From
"Dann Corbit"
Date:

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Wednesday, February 15, 2006 5:22 PM
> To: Ron
> Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index
> behaviour)
>
> Ron <rjpeace@earthlink.net> writes:
> > How are we choosing our pivots?
>
> See qsort.c: it looks like median of nine equally spaced inputs (ie,
> the 1/8th points of the initial input array, plus the end points),
> implemented as two rounds of median-of-three choices.  With half of
the
> data inputs zero, it's not too improbable for two out of the three
> samples to be zeroes in which case I think the med3 result will be
zero
> --- so choosing a pivot of zero is much more probable than one would
> like, and doing so in many levels of recursion causes the problem.

Adding some randomness to the selection of the pivot is a known
technique to fix the oddball partitions problem.  However, Bentley and
Sedgewick proved that every quick sort algorithm has some input set that
makes it go quadratic (hence the recent popularity of introspective
sort, which switches to heapsort if quadratic behavior is detected.  The
C++ template I submitted was an example of introspective sort, but
PostgreSQL does not use C++ so it was not helpful).

> I think.  I'm not too sure if the code isn't just being sloppy about
the
> case where many data values are equal to the pivot --- there's a
special
> case there to switch to insertion sort, and maybe that's getting
invoked
> too soon.

Here are some cases known to make qsort go quadratic:
1. Data already sorted
2. Data reverse sorted
3. Data organ-pipe sorted or ramp
4. Almost all data of the same value

There are probably other cases.  Randomizing the pivot helps some, as
does check for in-order or reverse order partitions.

Imagine if 1/3 of the partitions fall into a category that causes
quadratic behavior (have one of the above formats and have more than
CUTOFF elements in them).

It is doubtful that the switch to insertion sort is causing any sort of
problems.  It is only going to be invoked on tiny sets, for which it has
a fixed cost that is probably less that qsort() function calls on sets
of the same size.

>It'd be useful to get a line-level profile of the behavior of
> this code in the slow cases...

I guess that my in-order or presorted tests [which often arise when
there are very few distinct values] may solve the bad partition
problems.  Don't forget that the algorithm is called recursively.

>             regards, tom lane
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Re: [HACKERS] qsort again (was Re: Strange Create

From
Ron
Date:
At 08:37 PM 2/15/2006, Dann Corbit wrote:

>Adding some randomness to the selection of the pivot is a known
>technique to fix the oddball partitions problem.

True, but it makes QuickSort slower than say MergeSort because of the
expense of the PRNG being called ~O(lgN) times during a sort.


>However, Bentley and Sedgewick proved that every quick sort
>algorithm has some input set that makes it go quadratic

Yep.  OTOH, that input set can be so specific and so unusual as to
require astronomically unlikely bad luck or hostile hacking in order
for it to actually occur.


>  (hence the recent popularity of introspective sort, which switches
> to heapsort if quadratic behavior is detected.  The C++ template I
> submitted was an example of introspective sort, but PostgreSQL does
> not use C++ so it was not helpful).
...and there are other QuickSort+Other hybrids that address the issue
as well.  MergeSort, RadixExchangeSort, and BucketSort all come to
mind.  See Gonnet and Baeza-Yates, etc.


>Here are some cases known to make qsort go quadratic:
>1. Data already sorted

Only if one element is used to choose the pivot; _and_ only if the
pivot is the first or last element of each pass.
Even just always using the middle element as the pivot avoids this
problem.  See Sedgewick or Knuth.


>2. Data reverse sorted

Ditto above.


>3. Data organ-pipe sorted or ramp

Not sure what this means?  Regardless, median of n partitioning that
includes samples from each of the 1st 1/3, 2nd 1/3, and final 3rd of
the data is usually enough to guarantee O(NlgN) behavior unless the
_specific_ distribution known to be pessimal to that sampling
algorithm is encountered.  The only times I've ever seen it ITRW was
as a result of hostile activity: purposely arranging the data in such
a manner is essentially a DoS attack.


>4. Almost all data of the same value

Well known fixes to inner loop available to avoid this problem.


>There are probably other cases.  Randomizing the pivot helps some,
>as does check for in-order or reverse order partitions.
Randomizing the choice of pivot essentially guarantees O(NlgN)
behavior no matter what the distribution of the data at the price of
increasing the cost of each pass by a constant factor (the generation
of a random number or numbers).


In sum, QuickSort gets all sorts of bad press that is far more FUD
than fact ITRW.
Ron.



Re: [HACKERS] qsort again (was Re: Strange Create Index behaviour)

From
Bruce Momjian
Date:
Added to TODO:

    * Improve port/qsort() to handle sorts with 50% unique and 50% duplicate
      value [qsort]

      This involves choosing better pivot points for the quicksort.


---------------------------------------------------------------------------

Dann Corbit wrote:
>
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> > owner@postgresql.org] On Behalf Of Tom Lane
> > Sent: Wednesday, February 15, 2006 5:22 PM
> > To: Ron
> > Cc: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
> Index
> > behaviour)
> >
> > Ron <rjpeace@earthlink.net> writes:
> > > How are we choosing our pivots?
> >
> > See qsort.c: it looks like median of nine equally spaced inputs (ie,
> > the 1/8th points of the initial input array, plus the end points),
> > implemented as two rounds of median-of-three choices.  With half of
> the
> > data inputs zero, it's not too improbable for two out of the three
> > samples to be zeroes in which case I think the med3 result will be
> zero
> > --- so choosing a pivot of zero is much more probable than one would
> > like, and doing so in many levels of recursion causes the problem.
>
> Adding some randomness to the selection of the pivot is a known
> technique to fix the oddball partitions problem.  However, Bentley and
> Sedgewick proved that every quick sort algorithm has some input set that
> makes it go quadratic (hence the recent popularity of introspective
> sort, which switches to heapsort if quadratic behavior is detected.  The
> C++ template I submitted was an example of introspective sort, but
> PostgreSQL does not use C++ so it was not helpful).
>
> > I think.  I'm not too sure if the code isn't just being sloppy about
> the
> > case where many data values are equal to the pivot --- there's a
> special
> > case there to switch to insertion sort, and maybe that's getting
> invoked
> > too soon.
>
> Here are some cases known to make qsort go quadratic:
> 1. Data already sorted
> 2. Data reverse sorted
> 3. Data organ-pipe sorted or ramp
> 4. Almost all data of the same value
>
> There are probably other cases.  Randomizing the pivot helps some, as
> does check for in-order or reverse order partitions.
>
> Imagine if 1/3 of the partitions fall into a category that causes
> quadratic behavior (have one of the above formats and have more than
> CUTOFF elements in them).
>
> It is doubtful that the switch to insertion sort is causing any sort of
> problems.  It is only going to be invoked on tiny sets, for which it has
> a fixed cost that is probably less that qsort() function calls on sets
> of the same size.
>
> >It'd be useful to get a line-level profile of the behavior of
> > this code in the slow cases...
>
> I guess that my in-order or presorted tests [which often arise when
> there are very few distinct values] may solve the bad partition
> problems.  Don't forget that the algorithm is called recursively.
>
> >             regards, tom lane
> >
> > ---------------------------(end of
> broadcast)---------------------------
> > TIP 3: Have you checked our extensive FAQ?
> >
> >                http://www.postgresql.org/docs/faq
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

--
  Bruce Momjian   http://candle.pha.pa.us
  SRA OSS, Inc.   http://www.sraoss.com

  + If your life is a hard drive, Christ can be your backup. +