Re: No merge sort? - Mailing list pgsql-hackers

From Ron Peacetree
Subject Re: No merge sort?
Date
Msg-id FE_ka.16074$ey1.1385100@newsread1.prod.itd.earthlink.net
Whole thread Raw
In response to Re: No merge sort?  ("Dann Corbit" <DCorbit@connx.com>)
List pgsql-hackers
""Dann Corbit"" <DCorbit@connx.com> wrote in message
news:D90A5A6C612A39408103E6ECDD77B829408AC6@voyager.corporate.connx.co
m...
> Distribution sort is not an algorithm.  It is a general technique.
Both
> counting sort and radix sort are distribution sorts.  I think you
are
> talking about counting sort here.  In order to insert a technique
into a
> database, you must solve the general case.
>
Terminology problem.  Please look in the references I posted.


> > 2= For Radix sort, that's iff ("if and only if") you use
> > =characters= as the radix of a radix sort
> > and do a MSB aka partition-exchange sort. The appropriate
> > radix here is a =3Dfield=3D not a character.  Since there are 3
> > fields vs 60 characters the problem becomes 2/3 wasted passes
> > instead of 40/60.
>
> This is lunacy.  If you count the field or use it as a radix, you
will
> need a radix of 40*8 bits= 320 bits.  That means you will have 2^320
> 2.136e96 bins.
>
You only need as many bins as you have unique key values silly ;-)
Remember, at its core Radix sort is just a distribution counting sort
(that's the name I learned for the general technique).  The simplest
implementation uses bits as the atomic unit, but there's nothing
saying you have to...  In a DB, we know all the values of the fields
we currently have stored in the DB.  We can take advantage of that.

As you correctly implied, the goal is to minimize the number of unique
bins you have to, err, distribute, items into.  That and having as few
duplicates as feasible are the important points.

If, for example, we have <= 255 possible values for the each radix
(Using your previous example, let's say you have <= 255 unique values
for the combined field "Company+Division" in the DB and the same for
"Plant" ), and 4 sets of radix to sort over, it doesn't matter if we
are sorting a 32b quantity using a radix of 8b, or a 4 field quantity
using a radix of one field (TBF, we should use key and index
techniques to minimize the amount of actual data we retrieve and
manipulate from disk during the sort.  We want to minimize disk I/O,
particularly seeking disk I/O).  The situations are analogous.

Note TANSTAAFL (as Heinlein would've said):  We have to use extra
space for mapping the radix values to the radix keys, and our "inner
loop" for the sorting operation is considerably more complicated than
that for a quicksort or a mergesort.  Hence the fact that even though
this is an O(n) technique, in real world terms you can expect only a
2-3X performance improvement over say quicksort for realistic amounts
of data.

Oh and FTR, IME for most "interesting" sorts in the DB domain, even
for "internal" sorting techniques, the time to read data from disk and
(possibly) write results back to disk tends to dominate the time spent
actually doing the internal sort...

I learned tricks like this 20 years ago.  I thought this stuff was
part of general lore in the DB field... *shrug*



pgsql-hackers by date:

Previous
From: "Ron Peacetree"
Date:
Subject: Re: Anyone working on better transaction locking?
Next
From: "Ron Peacetree"
Date:
Subject: Re: Anyone working on better transaction locking?