Re: No merge sort? - Mailing list pgsql-hackers
From | Dann Corbit |
---|---|
Subject | Re: No merge sort? |
Date | |
Msg-id | D90A5A6C612A39408103E6ECDD77B829408ACF@voyager.corporate.connx.com Whole thread Raw |
In response to | No merge sort? (Taral <taral@taral.net>) |
List | pgsql-hackers |
> -----Original Message----- > From: Ron Peacetree [mailto:rjpeace@earthlink.net] > Sent: Friday, April 11, 2003 5:02 PM > To: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] No merge sort? > > > ""Dann Corbit"" <DCorbit@connx.com> wrote in message > news:D90A5A6C612A39408103E6ECDD77B829408ACB@voyager.corporate.connx.co > m... > > > From: Ron Peacetree [mailto:rjpeace@earthlink.net]=20 > > > Sent: Wednesday, April 09, 2003 12:38 PM > > > You only need as many bins as you have unique key values=20 silly > > > ;-) Remember, at its core Radix sort is just a=20 distribution > > > counting sort (that's the name I learned for the=20 general > > > technique). The simplest implementation uses bits as=20 > the atomic > > > unit, but there's nothing saying you have to...=20=20 In a DB, we > > > know all the values of the fields we currently=20 have > stored in the > > > DB. We can take advantage of that. > > > > By what magic do we know this? If a database knew a-priori what all > the > > distinct values were, it would indeed be excellent magic. > For any table already in the DB, this is self evident. If > you are talking about sorting data =before= it gets put into > the DB (say for a load), then things are different, and the > best you can do is used comparison based methods (and > probably some reasonably sophisticated external merging > routines in addition if the data set to be sorted and loaded > is big enough). The original question as I understood it was > about a sort as part of a query. That means everything to be > sorted is in the DB, and we can take advantage of what we know. The database does not know all the distinct values that it contains. > > With 80 bytes, you have 2^320 possible values. There is no > way around > > that. If you are going to count them or use them as a > radix, you will > > have to classify them. The only way you will know how many unique > > values you have in "Company+Division" is to ... > > Either sort them or by some means discover all that are distinct > Ummm, Indexes, particularly Primary Key Indexes, anyone? > Finding the unique values in an index should be perceived as > trivial... ...and you often have the index in memory for > other reasons already. If we have a unique clustered primary key index, why bother to sort? The order by will be faster without sorting. > > . The database does not know how many there are > > beforehand. Indeed, there could be anywhere > > from zero to 2^320 (given enough space) distinct values. > > > > I would be interested to see your algorithm that > > performs a counting or radix sort on 320 bit bins and that works in > > the general case without using extra space. > > > 1= See below. I clearly stated that we need to use extra > space 2= If your definition of "general" is "we know nothing > about the data", then of course any method based on using > more sophisticated ordering operators than comparisons is > severely hampered, if not doomed. This is not a comparison > based method. You have to know some things about the data > being sorted before you can use it. I've been =very= clear > about that throughout this. > > > > > Note TANSTAAFL (as Heinlein would've said): We have to use=20 > > >extra space for mapping the radix values to the radix > keys,=20 and > > >our "inner loop" for the sorting operation is=20 > considerably more > > >complicated than that for a quicksort or a=20 mergesort. > Hence the > > >fact that even though this is an O(n)=20 technique, in real world > > >terms you can expect only a 2-3X=20 performance > improvement over say > > >quicksort for realistic=20 amounts of data. > > >=20 > > > Oh and FTR, IME for most "interesting" sorts in the DB=20 > > > domain, even for "internal" sorting techniques, the time to=20 > > > read data from disk and > > > (possibly) write results back to disk tends to dominate the=20 > > > time spent actually doing the internal sort... > > >=20 > Since you don't believe me, and have implied you wouldn't > believe me even if I posted results of efforts on my part, go > sort some 2GB files by implementing the algorithms in the > sources I've given and as mentioned in this thread. (I'm > assuming that you have access to "real" machines that can > perform this as an internal sort, or we wouldn't be bothering > to have this discussion). Then come back to the table if you > still think there are open issues to be discussed. I don't believe you and I am not going to try to code a sort that does not work [and is demonstrably impossible]. But I will tell you what I will do. I will post the binary for a Win32 sorting routine that I wrote which will sort files of any size. If you can write a tool that sorts faster (and does not rely on a primary key which would render the task of sorting moot) then I will believe you. Here is the routine: ftp://cap.connx.com/chess-engines/new-approach/External.exe The syntax to sort a text file with it is: External <input_file> <output_file> <internal_sorting_buffer_size> So (for instance) to sort a 100 gigabyte file with 1 gig of physical memory allowed for the sort, the syntax would be: External 100gig.dat 100gig.sor 1000000000
pgsql-hackers by date: