Re: No merge sort? - Mailing list pgsql-hackers

From Ron Peacetree
Subject Re: No merge sort?
Date
Msg-id EHIla.19833$ey1.1702921@newsread1.prod.itd.earthlink.net
Whole thread Raw
In response to Re: No merge sort?  ("Dann Corbit" <DCorbit@connx.com>)
Responses Re: No merge sort?  (cbbrowne@cbbrowne.com)
List pgsql-hackers
""Dann Corbit"" <DCorbit@connx.com> wrote in message
news:D90A5A6C612A39408103E6ECDD77B829408ACB@voyager.corporate.connx.co
m...
> > From: Ron Peacetree [mailto:rjpeace@earthlink.net]=20
> > Sent: Wednesday, April 09, 2003 12:38 PM
> > You only need as many bins as you have unique key values=20
> > silly ;-) Remember, at its core Radix sort is just a=20
> > distribution counting sort (that's the name I learned for the=20
> > general technique).  The simplest implementation uses bits as=20
> > the atomic unit, but there's nothing saying you have to...=20=20
> > In a DB, we know all the values of the fields we currently=20
> > have stored in the DB.  We can take advantage of that.
>
> By what magic do we know this?  If a database knew a-priori what all
the
> distinct values were, it would indeed be excellent magic.
For any table already in the DB, this is self evident.  If you are
talking about sorting data =before= it gets put into the DB (say for a
load), then things are different, and the best you can do is used
comparison based methods (and probably some reasonably sophisticated
external merging routines in addition if the data set to be sorted and
loaded is big enough).  The original question as I understood it was
about a sort as part of a query.  That means everything to be sorted
is in the DB, and we can take advantage of what we know.


> With 80 bytes, you have 2^320 possible values.  There is no way
> around that.  If you are going to count them or use them as a
> radix, you will have to classify them.  The only way you will
> know how many unique values you have in
> "Company+Division" is to ...
> Either sort them or by some means discover all that are distinct
Ummm, Indexes, particularly Primary Key Indexes, anyone?  Finding the
unique values in an index should be perceived as trivial... ...and you
often have the index in memory for other reasons already.


> .  The database does not know how many there are
> beforehand.  Indeed, there could be anywhere
> from zero to 2^320 (given enough space) distinct values.
>
> I would be interested to see your algorithm that
> performs a counting or radix sort on 320 bit bins and that
> works in the general case without using extra space.
>
1= See below.  I clearly stated that we need to use extra space
2= If your definition of "general" is "we know nothing about the
data", then of course any method based on using more sophisticated
ordering operators than comparisons is severely hampered, if not
doomed.  This is not a comparison based method.  You have to know some
things about the data being sorted before you can use it.  I've been
=very= clear about that throughout this.


> > Note TANSTAAFL (as Heinlein would've said):  We have to use=20
> > extra space for mapping the radix values to the radix keys,=20
> > and our "inner loop" for the sorting operation is=20
> > considerably more complicated than that for a quicksort or a=20
> > mergesort.  Hence the fact that even though this is an O(n)=20
> > technique, in real world terms you can expect only a 2-3X=20
> > performance improvement over say quicksort for realistic=20
> > amounts of data.
> >=20
> > Oh and FTR, IME for most "interesting" sorts in the DB=20
> > domain, even for "internal" sorting techniques, the time to=20
> > read data from disk and
> > (possibly) write results back to disk tends to dominate the=20
> > time spent actually doing the internal sort...
> >=20
Since you don't believe me, and have implied you wouldn't believe me
even if I posted results of efforts on my part, go sort some 2GB files
by implementing the algorithms in the sources I've given and as
mentioned in this thread.  (I'm assuming that you have access to
"real" machines that can perform this as an internal sort, or we
wouldn't be bothering to have this discussion). Then come back to the
table if you still think there are open issues to be discussed.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Recovery tools
Next
From: Stu Krone
Date:
Subject: Re: How do you execute a postgresql function from perl?