Re: No merge sort? - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: No merge sort?
Date
Msg-id D90A5A6C612A39408103E6ECDD77B829408ACF@voyager.corporate.connx.com
Whole thread Raw
In response to No merge sort?  (Taral <taral@taral.net>)
List pgsql-hackers
> -----Original Message-----
> From: Ron Peacetree [mailto:rjpeace@earthlink.net]
> Sent: Friday, April 11, 2003 5:02 PM
> To: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] No merge sort?
>
>
> ""Dann Corbit"" <DCorbit@connx.com> wrote in message
> news:D90A5A6C612A39408103E6ECDD77B829408ACB@voyager.corporate.connx.co
> m...
> > > From: Ron Peacetree [mailto:rjpeace@earthlink.net]=20
> > > Sent: Wednesday, April 09, 2003 12:38 PM
> > > You only need as many bins as you have unique key values=20 silly
> > > ;-) Remember, at its core Radix sort is just a=20 distribution
> > > counting sort (that's the name I learned for the=20 general
> > > technique).  The simplest implementation uses bits as=20
> the atomic
> > > unit, but there's nothing saying you have to...=20=20 In a DB, we
> > > know all the values of the fields we currently=20 have
> stored in the
> > > DB.  We can take advantage of that.
> >
> > By what magic do we know this?  If a database knew a-priori what all
> the
> > distinct values were, it would indeed be excellent magic.
> For any table already in the DB, this is self evident.  If
> you are talking about sorting data =before= it gets put into
> the DB (say for a load), then things are different, and the
> best you can do is used comparison based methods (and
> probably some reasonably sophisticated external merging
> routines in addition if the data set to be sorted and loaded
> is big enough).  The original question as I understood it was
> about a sort as part of a query.  That means everything to be
> sorted is in the DB, and we can take advantage of what we know.

The database does not know all the distinct values that it contains.

> > With 80 bytes, you have 2^320 possible values.  There is no
> way around
> > that.  If you are going to count them or use them as a
> radix, you will
> > have to classify them.  The only way you will know how many unique
> > values you have in "Company+Division" is to ...
> > Either sort them or by some means discover all that are distinct
> Ummm, Indexes, particularly Primary Key Indexes, anyone?
> Finding the unique values in an index should be perceived as
> trivial... ...and you often have the index in memory for
> other reasons already.

If we have a unique clustered primary key index, why bother to sort?
The order by will be faster without sorting.

> > .  The database does not know how many there are
> > beforehand.  Indeed, there could be anywhere
> > from zero to 2^320 (given enough space) distinct values.
> >
> > I would be interested to see your algorithm that
> > performs a counting or radix sort on 320 bit bins and that works in
> > the general case without using extra space.
> >
> 1= See below.  I clearly stated that we need to use extra
> space 2= If your definition of "general" is "we know nothing
> about the data", then of course any method based on using
> more sophisticated ordering operators than comparisons is
> severely hampered, if not doomed.  This is not a comparison
> based method.  You have to know some things about the data
> being sorted before you can use it.  I've been =very= clear
> about that throughout this.
>
>
> > > Note TANSTAAFL (as Heinlein would've said):  We have to use=20
> > >extra space for mapping the radix values to the radix
> keys,=20  and
> > >our "inner loop" for the sorting operation is=20
> considerably more
> > >complicated than that for a quicksort or a=20  mergesort.
> Hence the
> > >fact that even though this is an O(n)=20  technique, in real world
> > >terms you can expect only a 2-3X=20  performance
> improvement over say
> > >quicksort for realistic=20  amounts of data.
> > >=20
> > > Oh and FTR, IME for most "interesting" sorts in the DB=20
> > > domain, even for "internal" sorting techniques, the time to=20
> > > read data from disk and
> > > (possibly) write results back to disk tends to dominate the=20
> > > time spent actually doing the internal sort...
> > >=20
> Since you don't believe me, and have implied you wouldn't
> believe me even if I posted results of efforts on my part, go
> sort some 2GB files by implementing the algorithms in the
> sources I've given and as mentioned in this thread.  (I'm
> assuming that you have access to "real" machines that can
> perform this as an internal sort, or we wouldn't be bothering
> to have this discussion). Then come back to the table if you
> still think there are open issues to be discussed.

I don't believe you and I am not going to try to code a sort that does
not work [and is demonstrably impossible].  But I will tell you what I
will do.  I will post the binary for a Win32 sorting routine that I
wrote which will sort files of any size.  If you can write a tool that
sorts faster (and does not rely on a primary key which would render the
task of sorting moot) then I will believe you.
Here is the routine:
ftp://cap.connx.com/chess-engines/new-approach/External.exe

The syntax to sort a text file with it is:
External <input_file> <output_file> <internal_sorting_buffer_size>

So (for instance) to sort a 100 gigabyte file with 1 gig of physical
memory allowed for the sort, the syntax would be:
External 100gig.dat 100gig.sor 1000000000



pgsql-hackers by date:

Previous
From: Curt Sampson
Date:
Subject: Re: Are we losing momentum?
Next
From: greg@turnstep.com
Date:
Subject: Re: Tuning heuristics