Re: No merge sort? - Mailing list pgsql-hackers

From Ron Peacetree
Subject Re: No merge sort?
Date
Msg-id 5Fpka.13367$ey1.1150154@newsread1.prod.itd.earthlink.net
Whole thread Raw
In response to Re: No merge sort?  (cbbrowne@cbbrowne.com)
List pgsql-hackers
<cbbrowne@cbbrowne.com> wrote in message
news:20030407202001.1EC3C58E0D@cbbrowne.com...
> Ron Peacetree wrote:
> > AFAIK, there are only 3 general purpose internal sorting
techniques
> > that have O(n) behavior:
>
> Hum?  NO "general purpose" sorting technique has O(n) behaviour.
>
> The theoretical best scenario, _in general_, is O(n log n).
The O(log(n!)) bound is only for comparison based sorts operating on
arbitarily disordered input.  There are general techniques that can
sort in O(n) time if we can "break" either assumption.  In a DB, we
often are dealing with data that is only slightly disordered, or we
are have enough meta knowledge that we can use more powerful ordering
operators than simple comparisons, or both.


> Insertion sort is expected to provide O(n^2) behaviour, and
radix-like
> schemes get arbitrarily memory hungry and have bad pathological
results.
>
None of these comments is accurate.
The sources for the following discussion are
A= Vol 3 of Knuth 2nd ed, (ISBN 0-201-89685-0)
B= Sedgewick's Algorithms in C, (0-201-51425-7)
C= Handbook of Algorithms and Data Structures 2nd ed by Gonnet and
Baeza-Yates. (0-201-41607-7)

1= Insertion sort is O(n) for "almost sorted" input.
p103 of Sedgewick.  There's also discussion on this in Knuth.

2= Distribution Sort and it's "cousins" which use more powerful
ordering operators than simply comparisons are a= reasonably general,
and b= O(n).
Look in all three references.

3= A proper implementation of straight Radix sort using 8b or 16b at a
time a= is NOT pathological, and b= is O(n).
Sedgewick is the most blunt about it on p142-143, but Knuth discusses
this as well.

All of the above are stable, which quicksort is not.  There are no
"pessimal" inputs for any of the above that will force worst case
behavior.  For quicksort there are (they are =very= unlikely however).
In real world terms, if you can use any of these approaches you should
be able to internally sort your data between 2X and 3X faster.


Unfortunately, most of us do not have the luxury of working with
Memory Resident DB's.  In the real world, disk access is an important
part of our DB sorting efforts, and that changes things.  Very fast
internal sorting routines combined with multidisk merging algorithms
that minimize overall disk I/O while maximizing the sequential nature
of that I/O are the best we can do overall in such a situation.



pgsql-hackers by date:

Previous
From: Andreas Pflug
Date:
Subject: pg_get_viewdef 7.4 et al
Next
From: "Ron Peacetree"
Date:
Subject: Re: Anyone know why PostgreSQL doesn't support 2 phase execution?