Re: No merge sort? - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: No merge sort?
Date
Msg-id D90A5A6C612A39408103E6ECDD77B829408ABE@voyager.corporate.connx.com
Whole thread Raw
In response to No merge sort?  (Taral <taral@taral.net>)
List pgsql-hackers
> -----Original Message-----
> From: cbbrowne@cbbrowne.com [mailto:cbbrowne@cbbrowne.com]
> Sent: Monday, April 07, 2003 1:20 PM
> To: Ron Peacetree
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] No merge sort?
>
>
> Ron Peacetree wrote:
> > AFAIK, there are only 3 general purpose internal sorting techniques
> > that have O(n) behavior:
>
> Hum?  NO "general purpose" sorting technique has O(n) behaviour.
>
> The theoretical best scenario, _in general_, is O(n log n).

The ACM has published papers on several techniques that are O(n log log
n)
Radix and counting sort can be totally generalized in the same way that
comparision sorts can by creating a callback function.

If I have strings of 50 characters in length, I can count all RADIX
characters in an array of unsigned long counts[RADIX][50] in a single
pass.  As the key strings become successively larger, radix sorts fall
flat compared to comparison techniques, but a counting pass can also
determine whether distribution or comparision sorting will be better.
> Insertion sort is expected to provide O(n^2) behaviour, and
> radix-like schemes get arbitrarily memory hungry and have bad
> pathological results.

However, the pathological cases can be detected.  A typical pathological
case for real database work is an array where the initial data is
identical for a long distance.  This is not unusual in an ORDER BY/GROUP
BY scenario.
> > All of the above have potentially nasty trade-offs in
> comparision to
> > the standard heavily tuned median-of-three quicksort used
> by the sort
> > unix command.  Nonetheless, I could see some value in
> providing all of
> > these with PostgeSQL (including a decent port of the unix
> sort routine
> > for the Win folks).  I'll note in passing that quicksort's
> Achille's
> > heel is that it's unstable (all of the rest are stable),
> which can be
> > a problem in a DB.

A better sort can make your database dominatingly better than others for
some operations.  It can be an enormous technical advantage.
> Making one sort algorithm work very efficiently is quite
> likely to be a lot more effective than frittering away time
> trying to get some special cases (that you can't regularly
> use) to work acceptably.

I don't agree.  In fact, this is never true.  A decent sorting routine
usually involves several techniques.  Consider the library qsort
routine.  Usually, it will use quicksort with several special tests and
a randomized median selection until the set size becomes small enough.
Then, we switch to insertion sort or shellsort.
Or consider the STL and Plauger's library sort routine.  It is
introspective sort and is a hybrid of three algorithms (quicksort,
heapsort, and insertion sort).
> > All of this seems to imply that instead of mergesort (which is
> > stable), PostgreSQL might be better off with the 4 sorts
> I've listed
> > plus a viciously efficient merge utility for combining
> partial results
> > that do fit into memory into result files on disk that don't.
> >
> > Or am I crazy?
>
> More than likely.  It is highly likely that it will typically
> take more computational effort to figure out that one of the
> 4 sorts provided /any/ improvement than any computational
> effort they would save.

It is highly unlikely that this is true.  In fact, every good sorting
routine does *exactly* this.
> That's a /very/ common problem.  There's also a fair chance,
> seen in practice, that the action of collecting additional
> statistics to improve query optimization will cost more than
> the savings provided by the optimizations.
> --
> (concatenate 'string "cbbrowne" "@acm.org")
> http://www3.sympatico.ca/cbbrowne/wp.html
> When ever in doubt
> consult a song. --JT Fletcher
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
http://archives.postgresql.org



pgsql-hackers by date:

Previous
From: "Jason M. Felice"
Date:
Subject: Re: No merge sort?
Next
From: "Dann Corbit"
Date:
Subject: Re: No merge sort?