Re: No merge sort? - Mailing list pgsql-hackers

From Ron Peacetree
Subject Re: No merge sort?
Date
Msg-id rBYja.11707$4P1.1013834@newsread2.prod.itd.earthlink.net
Whole thread Raw
In response to No merge sort?  (Taral <taral@taral.net>)
Responses Re: No merge sort?  (Greg Stark <gsstark@mit.edu>)
Re: No merge sort?  (cbbrowne@cbbrowne.com)
List pgsql-hackers
AFAIK, there are only 3 general purpose internal sorting techniques
that have O(n) behavior:
1= Insertion Sort for "almost sorted" lists.  Since "almost sorted" is
a fuzzy concept, let's make the first approximation definition that no
more than n^(1/2) of the elements can be disordered.  There are better
definitions in the literature, but I don't remember them off the top
of my head.

2= Sorts from the class of Address Calculation, Counting, or
Distribution Sort.  These need to be able to carry out something more
complex than simply a comparison in order to achieve O(n), and
therefore have high constants in their execution.  For large enough n
though, these are the performance kings.

3= Straight Radix Sort where you minimize the number of passes by
using a base much greater than two for the the radix.  Usually octal
or hexidecimal.  On a 32b or 64b system, this approach will let you
sort in 2 passes.

All of the above have potentially nasty trade-offs in comparision to
the standard heavily tuned median-of-three quicksort used by the sort
unix command.  Nonetheless, I could see some value in providing all of
these with PostgeSQL (including a decent port of the unix sort routine
for the Win folks).  I'll note in passing that quicksort's Achille's
heel is that it's unstable (all of the rest are stable), which can be
a problem in a DB.

In the general case there's a few papers out there that state you can
sort in O(n) if you can throw O(n^2) space at the problem.  That
implies to me that for DB's, we are not going to be able to use O(n)
algorithms for most of our needs.


As for external sorts, everything I've ever read says that some sort
of merge technique is used:  balanced, multiway, or polyphase.  In all
cases, I've seen comments to the effect that reading some part of the
data into internal buffers, sorting them, and then merging them with
already sorted data is the best practice.


All of this seems to imply that instead of mergesort (which is
stable), PostgreSQL might be better off with the 4 sorts I've listed
plus a viciously efficient merge utility for combining partial results
that do fit into memory into result files on disk that don't.


Or am I crazy?

Ron Peacetree



pgsql-hackers by date:

Previous
From: "Ron Peacetree"
Date:
Subject: Anyone know why PostgreSQL doesn't support 2 phase execution?
Next
From: Rod Taylor
Date:
Subject: Re: information_schema 7.4