Thread: Re: qsort again (was Re: [PERFORM] Strange Create Index

Re: qsort again (was Re: [PERFORM] Strange Create Index

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Markus Schaber
> Sent: Thursday, February 16, 2006 5:45 AM
> To: pgsql-performance@postgresql.org; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create
Index
>
> Hi, Ron,
>
> Ron wrote:
>
> > ...and of course if you know enough about the data to be sorted so
as to
> > constrain it appropriately, one should use a non comparison based
O(N)
> > sorting algorithm rather than any of the general comparison based
> > O(NlgN) methods.
>
> Sounds interesting, could you give us some pointers (names, URLs,
> papers) to such algorithms?

He refers to counting sort and radix sort (which comes in most
significant digit and least significant digit format).  These are also
called distribution (as opposed to comparison) sorts.

These sorts are O(n) as a function of the data size, but really they are
O(M*n) where M is the average key length and n is the data set size.
(In the case of MSD radix sort M is the average length to completely
differentiate strings)

So if you have an 80 character key, then 80*log(n) will only be faster
than n*log(n) when you have 2^80th elements -- in other words -- never.

If you have short keys, on the other hand, distribution sorts will be
dramatically faster.  On an unsigned integer, for instance, it requires
4 passes with 8 bit buckets and so 16 elements is the crossover to radix
is faster than an O(n*log(n)) sort.  Of course, there is a fixed
constant of proportionality and so it will really be higher than that,
but for large data sets distribution sorting is the best thing that
there is for small keys.

You could easily have an introspective MSD radix sort.  The nice thing
about MSD radix sort is that you can retain the ordering that has
occurred so far and switch to another algorithm.

An introspective MSD radix sort could call an introspective quick sort
algorithm once it processed a crossover point of buckets of key data.

In order to have distribution sorts that will work with a database
system, for each and every data type you will need a function to return
the bucket of bits of significance for the kth bucket of bits.  For a
character string, you could return key[bucket].  For an unsigned integer
it is the byte of the integer to return will be a function of the
endianness of the CPU.  And for each other distinct data type a bucket
function would be needed or a sort could not be generated for that type
using the distribution method.

Re: qsort again (was Re: [PERFORM] Strange Create Index

From
Jens-Wolfhard Schicke
Date:
--On Donnerstag, Februar 16, 2006 10:39:45 -0800 Dann Corbit
<DCorbit@connx.com> wrote:

> He refers to counting sort and radix sort (which comes in most
> significant digit and least significant digit format).  These are also
> called distribution (as opposed to comparison) sorts.
>
> These sorts are O(n) as a function of the data size, but really they are
> O(M*n) where M is the average key length and n is the data set size.
> (In the case of MSD radix sort M is the average length to completely
> differentiate strings)
>
> So if you have an 80 character key, then 80*log(n) will only be faster
I suppose you meant 80 * n here?

> than n*log(n) when you have 2^80th elements -- in other words -- never.
I think this is wrong. You can easily improve Radix sort by a constant if
you don't take single bytes as the digits but rather k-byte values. At
least 2 byte should be possible without problems. This would give you 40 *
n time, not 80 * n. And you assume that the comparision of an 80-byte wide
value only costs 1, which might (and in many cases will be imho) wrong.
Actually it migh mean to compare 80 bytes as well, but I might be wrong.

What I think as the biggest problem is the digit representation necessary
for Radix-Sort in cases of locales which sort without looking at spaces. I
assume that would be hard to implement. The same goes for the proposed
mapping of string values onto 4/8-byte values.

Mit freundlichem Gruß
Jens Schicke
--
Jens Schicke              j.schicke@asco.de
asco GmbH              http://www.asco.de
Mittelweg 7              Tel 0531/3906-127
38106 Braunschweig          Fax 0531/3906-400


Re: qsort again (was Re: [PERFORM] Strange Create Index

From
Martijn van Oosterhout
Date:
On Fri, Feb 17, 2006 at 09:18:39AM +0100, Jens-Wolfhard Schicke wrote:
> What I think as the biggest problem is the digit representation necessary
> for Radix-Sort in cases of locales which sort without looking at spaces. I
> assume that would be hard to implement. The same goes for the proposed
> mapping of string values onto 4/8-byte values.

Actually, this is easy. The standard C library provides strxfrm() and
other locale toolkits like ICU provide ucol_getSortKey(). Windows
provides LCMapString(). Just pass each string through this and take the
first four bytes of the result to form your integer key.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.