Re: No merge sort? - Mailing list pgsql-hackers
From | Ron Peacetree |
---|---|
Subject | Re: No merge sort? |
Date | |
Msg-id | zKHka.15370$4P1.1343655@newsread2.prod.itd.earthlink.net Whole thread Raw |
In response to | Re: No merge sort? ("Dann Corbit" <DCorbit@connx.com>) |
List | pgsql-hackers |
""Dann Corbit"" <DCorbit@connx.com> wrote in message news:D90A5A6C612A39408103E6ECDD77B829408AC4@voyager.corporate.connx.co m... > There can be a problem here Distribution & Radix sorts. Imagine the following query: > SELECT company, division, plant, sum(revenue) FROM corporate GROUP BY > company, division, plant ORDER BY company, division, plant > > Let's suppose further that company, division, and plant are all char 20. > For a given section of company + division, the first 40 characters will > be the same. So a radix or distribution sort will need 40 passes over > the identical data before it even gets to the differentiating field > plant. > 1= Distribution sort does not have this problem. Any data that can be constrained by the "pigeon hole principle" can be sorted in o(n) time. 2= For Radis sort, that's iff ("if and only if") you use =characters= as the radix of a radix sort and do a MSB aka partition-exchange sort. The appropriate radix here is a =field= not a character. Since there are 3 fields vs 60 characters the problem becomes 2/3 wasted passes instead of 40/60. We can do even better by making one field for company+division and one for plant. Now we have no wasted passes and an O(n) sort. Since PostgreSQL has wonderful built in support for string handling and regexp, this is clearly an easy to implement and superior approach. Since a comparison based sorting algorithm would be best implemented by working on the same atomic units, the fair comparison becomes the log(n!) behavior of that sort vs the O(n) behavior of Distribution or Radix sort on data of that atomic size. The constants involved in each ordering operation are bigger for the two O(n) sorts, but despite that you'll find that for reasonable sizes of real world data, you'll complete the O(n) sorts 2-3X faster. Of course, internal sorting performance is vastly dominated by disk I/O costs if there is even a moderate amount disk I/O involved. However, there are ways to combine decent internal sorting with efficient disk merging, getting the best possible out of both subsystems. > If they are fixed width, we could use LSD radix sort and count all the > bins in one pass. But that technique will fail with varchar. > Naturally, most long character fields will be varchar. The basic upshot > is that radix and distribution sorts work best with short keys. > More accurate would be say that such searches work best with keys that have as little in common as possible to each other. Length isn't the problem, =the number of unique keys= and having few equal keys is. Large quantities of equal keys can trash the performance of many sorting algorithms, including quicksort, unless steps are taken to either modify the algorithm to handle it or the data to minimize such an occurrence.
pgsql-hackers by date: