Re: No merge sort? - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: No merge sort?
Date
Msg-id D90A5A6C612A39408103E6ECDD77B829408ACB@voyager.corporate.connx.com
Whole thread Raw
In response to No merge sort?  (Taral <taral@taral.net>)
List pgsql-hackers
> -----Original Message-----
> From: Ron Peacetree [mailto:rjpeace@earthlink.net]
> Sent: Wednesday, April 09, 2003 12:38 PM
> To: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] No merge sort?
>
>
> ""Dann Corbit"" <DCorbit@connx.com> wrote in message
> news:D90A5A6C612A39408103E6ECDD77B829408AC6@voyager.corporate.connx.co
> m...
> > Distribution sort is not an algorithm.  It is a general technique.
> Both
> > counting sort and radix sort are distribution sorts.  I think you
> are
> > talking about counting sort here.  In order to insert a technique
> into a
> > database, you must solve the general case.
> >
> Terminology problem.  Please look in the references I posted.
>
>
> > > 2= For Radix sort, that's iff ("if and only if") you use
> > > =characters= as the radix of a radix sort and do a MSB aka
> > > partition-exchange sort. The appropriate radix here is a
> =3Dfield=3D
> > > not a character.  Since there are 3 fields vs 60 characters the
> > > problem becomes 2/3 wasted passes instead of 40/60.
> >
> > This is lunacy.  If you count the field or use it as a radix, you
> will
> > need a radix of 40*8 bits= 320 bits.  That means you will
> have 2^320
> > 2.136e96 bins.
> >
> You only need as many bins as you have unique key values
> silly ;-) Remember, at its core Radix sort is just a
> distribution counting sort (that's the name I learned for the
> general technique).  The simplest implementation uses bits as
> the atomic unit, but there's nothing saying you have to...
> In a DB, we know all the values of the fields we currently
> have stored in the DB.  We can take advantage of that.

By what magic do we know this?  If a database knew a-priori what all the
distinct values were, it would indeed be excellent magic.
> As you correctly implied, the goal is to minimize the number
> of unique bins you have to, err, distribute, items into.
> That and having as few duplicates as feasible are the
> important points.
>
> If, for example, we have <= 255 possible values for the each
> radix (Using your previous example, let's say you have <= 255
> unique values for the combined field "Company+Division" in
> the DB and the same for "Plant" ), and 4 sets of radix to
> sort over, it doesn't matter if we are sorting a 32b quantity
> using a radix of 8b, or a 4 field quantity using a radix of
> one field (TBF, we should use key and index techniques to
> minimize the amount of actual data we retrieve and manipulate
> from disk during the sort.  We want to minimize disk I/O,
> particularly seeking disk I/O).  The situations are analogous.

With 80 bytes, you have 2^320 possible values.  There is no way around
that.  If you are going to count them or use them as a radix, you will
have to classify them.  The only way you will know how many unique
values you have in "Company+Division" is to ... Either sort them or by
some other means discover all that are distinct.  The database does not
know how many there are beforehand.  Indeed, there could be anywhere
from zero to 2^320 (given enough space) distinct values.

I would be interested to see your algorithm that performs a counting or
radix sort on 320 bit bins and that works in the general case without
using extra space.

> Note TANSTAAFL (as Heinlein would've said):  We have to use
> extra space for mapping the radix values to the radix keys,
> and our "inner loop" for the sorting operation is
> considerably more complicated than that for a quicksort or a
> mergesort.  Hence the fact that even though this is an O(n)
> technique, in real world terms you can expect only a 2-3X
> performance improvement over say quicksort for realistic
> amounts of data.
>
> Oh and FTR, IME for most "interesting" sorts in the DB
> domain, even for "internal" sorting techniques, the time to
> read data from disk and
> (possibly) write results back to disk tends to dominate the
> time spent actually doing the internal sort...
>
> I learned tricks like this 20 years ago.  I thought this
> stuff was part of general lore in the DB field... *shrug*

As my grandfather used to say:
"Picklesmoke."

> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>



pgsql-hackers by date:

Previous
From: Sean Chittenden
Date:
Subject: Re: [GENERAL] medical image on postgreSQL?
Next
From: "scott.marlowe"
Date:
Subject: Re: [GENERAL] medical image on postgreSQL?