Re: No merge sort? - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: No merge sort?
Date
Msg-id D90A5A6C612A39408103E6ECDD77B829408AC6@voyager.corporate.connx.com
Whole thread Raw
In response to No merge sort?  (Taral <taral@taral.net>)
List pgsql-hackers
> -----Original Message-----
> From: Ron Peacetree [mailto:rjpeace@earthlink.net]
> Sent: Tuesday, April 08, 2003 3:07 PM
> To: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] No merge sort?
>
>
> ""Dann Corbit"" <DCorbit@connx.com> wrote in message
> news:D90A5A6C612A39408103E6ECDD77B829408AC4@voyager.corporate.connx.co
> m...
> > There can be a problem here Distribution & Radix sorts.  Imagine the
> following query:
> > SELECT company, division, plant, sum(revenue) FROM corporate GROUP
> BY
> > company, division, plant ORDER BY company, division, plant
> >
> > Let's suppose further that company, division, and plant are all char
> 20.
> > For a given section of company + division, the first 40 characters
> will
> > be the same.  So a radix or distribution sort will need 40 passes
> over
> > the identical data before it even gets to the differentiating field
> > plant.
> >
> 1= Distribution sort does not have this problem.  Any data
> that can be constrained by the "pigeon hole principle" can be
> sorted in o(n) time.

Distribution sort is not an algorithm.  It is a general technique.  Both
counting sort and radix sort are distribution sorts.  I think you are
talking about counting sort here.  In order to insert a technique into a
database, you must solve the general case.

>2= For Radis sort, that's iff ("if and
> only if") you use =characters= as the radix of a radix sort
> and do a MSB aka partition-exchange sort. The appropriate
> radix here is a =field= not a character.  Since there are 3
> fields vs 60 characters the problem becomes 2/3 wasted passes
> instead of 40/60.

This is lunacy.  If you count the field or use it as a radix, you will
need a radix of 40*8 bits = 320 bits.  That means you will have 2^320 =
2.136e96 bins.

>We can do even better by making one field for
> company+division and one for plant.  Now we have no wasted passes and
> an O(n) sort.  Since PostgreSQL has wonderful built in
> support for string handling and regexp, this is clearly an
> easy to implement and superior approach.

You have no idea what you are talking about.
> Since a comparison based sorting algorithm would be best
> implemented by working on the same atomic units, the fair
> comparison becomes the
> log(n!) behavior of that sort vs the O(n) behavior of
> Distribution or Radix sort on data of that atomic size.  The
> constants involved in each ordering operation are bigger for
> the two O(n) sorts, but despite that you'll find that for
> reasonable sizes of real world data, you'll complete the O(n)
> sorts 2-3X faster.

With 2e96 * 4 space requirement (assuming that 4 billion is the largest
row count allowed in the database).  If we could build a memory that
large, how long would it take to reach the last bin?  There are an
estimated 10^80 elementary particles in the universe.
> Of course, internal sorting performance is vastly dominated
> by disk I/O costs if there is even a moderate amount disk I/O
> involved.  However, there are ways to combine decent internal
> sorting with efficient disk merging, getting the best
> possible out of both subsystems.

You are confusing terms here.  An internal sort is performed 100% in
memory.  It is an external sort that uses disk space.

> > If they are fixed width, we could use LSD radix sort and count all
> the
> > bins in one pass.  But that technique will fail with varchar.
> > Naturally, most long character fields will be varchar.  The basic
> upshot
> > is that radix and distribution sorts work best with short keys.
> >
> More accurate would be say that such searches work best with
> keys that have as little in common as possible to each other.
>  Length isn't the problem, =the number of unique keys= and
> having few equal keys is. Large quantities of equal keys can
> trash the performance of many sorting algorithms, including
> quicksort, unless steps are taken to either modify the
> algorithm to handle it or the data to minimize such an occurrence.

I would guess that you have never done any sorting work.
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>



pgsql-hackers by date:

Previous
From: cbbrowne@cbbrowne.com
Date:
Subject: Re: Anyone working on better transaction locking?
Next
From: Tom Lane
Date:
Subject: More thoughts about FE/BE protocol