Re: qsort again (was Re: [PERFORM] Strange Create Index - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: qsort again (was Re: [PERFORM] Strange Create Index
Date
Msg-id D425483C2C5C9F49B5B7A41F8944154757D551@postal.corporate.connx.com
Whole thread Raw
Responses Re: qsort again (was Re: [PERFORM] Strange Create Index
List pgsql-hackers
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Markus Schaber
> Sent: Thursday, February 16, 2006 5:45 AM
> To: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index
>
> Hi, Ron,
>
> Ron wrote:
>
> > ...and of course if you know enough about the data to be sorted so
as to
> > constrain it appropriately, one should use a non comparison based
O(N)
> > sorting algorithm rather than any of the general comparison based
> > O(NlgN) methods.
>
> Sounds interesting, could you give us some pointers (names, URLs,
> papers) to such algorithms?

He refers to counting sort and radix sort (which comes in most
significant digit and least significant digit format).  These are also
called distribution (as opposed to comparison) sorts.

These sorts are O(n) as a function of the data size, but really they are
O(M*n) where M is the average key length and n is the data set size.
(In the case of MSD radix sort M is the average length to completely
differentiate strings)

So if you have an 80 character key, then 80*log(n) will only be faster
than n*log(n) when you have 2^80th elements -- in other words -- never.

If you have short keys, on the other hand, distribution sorts will be
dramatically faster.  On an unsigned integer, for instance, it requires
4 passes with 8 bit buckets and so 16 elements is the crossover to radix
is faster than an O(n*log(n)) sort.  Of course, there is a fixed
constant of proportionality and so it will really be higher than that,
but for large data sets distribution sorting is the best thing that
there is for small keys.

You could easily have an introspective MSD radix sort.  The nice thing
about MSD radix sort is that you can retain the ordering that has
occurred so far and switch to another algorithm.

An introspective MSD radix sort could call an introspective quick sort
algorithm once it processed a crossover point of buckets of key data.

In order to have distribution sorts that will work with a database
system, for each and every data type you will need a function to return
the bucket of bits of significance for the kth bucket of bits.  For a
character string, you could return key[bucket].  For an unsigned integer
it is the byte of the integer to return will be a function of the
endianness of the CPU.  And for each other distinct data type a bucket
function would be needed or a sort could not be generated for that type
using the distribution method.

pgsql-hackers by date:

Previous
From: Scott Lamb
Date:
Subject: Re: qsort again (was Re: [PERFORM] Strange Create
Next
From: Ron
Date:
Subject: Re: qsort again (was Re: [PERFORM] Strange Create