Re: No merge sort? - Mailing list pgsql-hackers
From | Dann Corbit |
---|---|
Subject | Re: No merge sort? |
Date | |
Msg-id | D90A5A6C612A39408103E6ECDD77B829408AC6@voyager.corporate.connx.com Whole thread Raw |
In response to | No merge sort? (Taral <taral@taral.net>) |
List | pgsql-hackers |
> -----Original Message----- > From: Ron Peacetree [mailto:rjpeace@earthlink.net] > Sent: Tuesday, April 08, 2003 3:07 PM > To: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] No merge sort? > > > ""Dann Corbit"" <DCorbit@connx.com> wrote in message > news:D90A5A6C612A39408103E6ECDD77B829408AC4@voyager.corporate.connx.co > m... > > There can be a problem here Distribution & Radix sorts. Imagine the > following query: > > SELECT company, division, plant, sum(revenue) FROM corporate GROUP > BY > > company, division, plant ORDER BY company, division, plant > > > > Let's suppose further that company, division, and plant are all char > 20. > > For a given section of company + division, the first 40 characters > will > > be the same. So a radix or distribution sort will need 40 passes > over > > the identical data before it even gets to the differentiating field > > plant. > > > 1= Distribution sort does not have this problem. Any data > that can be constrained by the "pigeon hole principle" can be > sorted in o(n) time. Distribution sort is not an algorithm. It is a general technique. Both counting sort and radix sort are distribution sorts. I think you are talking about counting sort here. In order to insert a technique into a database, you must solve the general case. >2= For Radis sort, that's iff ("if and > only if") you use =characters= as the radix of a radix sort > and do a MSB aka partition-exchange sort. The appropriate > radix here is a =field= not a character. Since there are 3 > fields vs 60 characters the problem becomes 2/3 wasted passes > instead of 40/60. This is lunacy. If you count the field or use it as a radix, you will need a radix of 40*8 bits = 320 bits. That means you will have 2^320 = 2.136e96 bins. >We can do even better by making one field for > company+division and one for plant. Now we have no wasted passes and > an O(n) sort. Since PostgreSQL has wonderful built in > support for string handling and regexp, this is clearly an > easy to implement and superior approach. You have no idea what you are talking about. > Since a comparison based sorting algorithm would be best > implemented by working on the same atomic units, the fair > comparison becomes the > log(n!) behavior of that sort vs the O(n) behavior of > Distribution or Radix sort on data of that atomic size. The > constants involved in each ordering operation are bigger for > the two O(n) sorts, but despite that you'll find that for > reasonable sizes of real world data, you'll complete the O(n) > sorts 2-3X faster. With 2e96 * 4 space requirement (assuming that 4 billion is the largest row count allowed in the database). If we could build a memory that large, how long would it take to reach the last bin? There are an estimated 10^80 elementary particles in the universe. > Of course, internal sorting performance is vastly dominated > by disk I/O costs if there is even a moderate amount disk I/O > involved. However, there are ways to combine decent internal > sorting with efficient disk merging, getting the best > possible out of both subsystems. You are confusing terms here. An internal sort is performed 100% in memory. It is an external sort that uses disk space. > > If they are fixed width, we could use LSD radix sort and count all > the > > bins in one pass. But that technique will fail with varchar. > > Naturally, most long character fields will be varchar. The basic > upshot > > is that radix and distribution sorts work best with short keys. > > > More accurate would be say that such searches work best with > keys that have as little in common as possible to each other. > Length isn't the problem, =the number of unique keys= and > having few equal keys is. Large quantities of equal keys can > trash the performance of many sorting algorithms, including > quicksort, unless steps are taken to either modify the > algorithm to handle it or the data to minimize such an occurrence. I would guess that you have never done any sorting work. > ---------------------------(end of > broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to > majordomo@postgresql.org >
pgsql-hackers by date: