Re: No merge sort? - Mailing list pgsql-hackers

From Ron Peacetree
Subject Re: No merge sort?
Date
Msg-id zKHka.15370$4P1.1343655@newsread2.prod.itd.earthlink.net
Whole thread Raw
In response to Re: No merge sort?  ("Dann Corbit" <DCorbit@connx.com>)
List pgsql-hackers
""Dann Corbit"" <DCorbit@connx.com> wrote in message
news:D90A5A6C612A39408103E6ECDD77B829408AC4@voyager.corporate.connx.co
m...
> There can be a problem here Distribution & Radix sorts.  Imagine the
following query:
> SELECT company, division, plant, sum(revenue) FROM corporate GROUP
BY
> company, division, plant ORDER BY company, division, plant
>
> Let's suppose further that company, division, and plant are all char
20.
> For a given section of company + division, the first 40 characters
will
> be the same.  So a radix or distribution sort will need 40 passes
over
> the identical data before it even gets to the differentiating field
> plant.
>
1= Distribution sort does not have this problem.  Any data that can be
constrained by the "pigeon hole principle" can be sorted in o(n) time.
2= For Radis sort, that's iff ("if and only if") you use =characters=
as the radix of a radix sort and do a MSB aka partition-exchange sort.
The appropriate radix here is a =field= not a character.  Since there
are 3 fields vs 60 characters the problem becomes 2/3 wasted passes
instead of 40/60.  We can do even better by making one field for
company+division and one for plant.  Now we have no wasted passes and
an O(n) sort.  Since PostgreSQL has wonderful built in support for
string handling and regexp, this is clearly an easy to implement and
superior approach.

Since a comparison based sorting algorithm would be best implemented
by working on the same atomic units, the fair comparison becomes the
log(n!) behavior of that sort vs the O(n) behavior of Distribution or
Radix sort on data of that atomic size.  The constants involved in
each ordering operation are bigger for the two O(n) sorts, but despite
that you'll find that for reasonable sizes of real world data, you'll
complete the O(n) sorts 2-3X faster.

Of course, internal sorting performance is vastly dominated by disk
I/O costs if there is even a moderate amount
disk I/O involved.  However, there are ways to combine decent internal
sorting with efficient disk merging, getting the best possible out of
both subsystems.


> If they are fixed width, we could use LSD radix sort and count all
the
> bins in one pass.  But that technique will fail with varchar.
> Naturally, most long character fields will be varchar.  The basic
upshot
> is that radix and distribution sorts work best with short keys.
>
More accurate would be say that such searches work best with keys that
have as little in common as possible to each other.  Length isn't the
problem, =the number of unique keys= and having few equal keys is.
Large quantities of equal keys can trash the performance of many
sorting algorithms, including quicksort, unless steps are taken to
either modify the algorithm to handle it or the data to minimize such
an occurrence.



pgsql-hackers by date:

Previous
From: "Carles Beltran Vazquez"
Date:
Subject: Bit Filters
Next
From: "Nigel J. Andrews"
Date:
Subject: Re: How do you execute a postgresql function from perl?