Re: No merge sort? - Mailing list pgsql-hackers
From | Dann Corbit |
---|---|
Subject | Re: No merge sort? |
Date | |
Msg-id | D90A5A6C612A39408103E6ECDD77B829408AC4@voyager.corporate.connx.com Whole thread Raw |
In response to | No merge sort? (Taral <taral@taral.net>) |
List | pgsql-hackers |
> -----Original Message----- > From: Ron Peacetree [mailto:rjpeace@earthlink.net] > Sent: Monday, April 07, 2003 6:33 PM > To: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] No merge sort? > > > <cbbrowne@cbbrowne.com> wrote in message > news:20030407202001.1EC3C58E0D@cbbrowne.com... > > Ron Peacetree wrote: > > > AFAIK, there are only 3 general purpose internal sorting > techniques > > > that have O(n) behavior: > > > > Hum? NO "general purpose" sorting technique has O(n) behaviour. > > > > The theoretical best scenario, _in general_, is O(n log n). > The O(log(n!)) bound is only for comparison based sorts > operating on arbitarily disordered input. There are general > techniques that can sort in O(n) time if we can "break" > either assumption. In a DB, we often are dealing with data > that is only slightly disordered, or we are have enough meta > knowledge that we can use more powerful ordering operators > than simple comparisons, or both. > > > > Insertion sort is expected to provide O(n^2) behaviour, and > radix-like > > schemes get arbitrarily memory hungry and have bad pathological > results. > > > None of these comments is accurate. > The sources for the following discussion are > A= Vol 3 of Knuth 2nd ed, (ISBN 0-201-89685-0) > B= Sedgewick's Algorithms in C, (0-201-51425-7) > C= Handbook of Algorithms and Data Structures 2nd ed by > Gonnet and Baeza-Yates. (0-201-41607-7) > > 1= Insertion sort is O(n) for "almost sorted" input. > p103 of Sedgewick. There's also discussion on this in Knuth. > > 2= Distribution Sort and it's "cousins" which use more > powerful ordering operators than simply comparisons are a= > reasonably general, and b= O(n). Look in all three references. > > 3= A proper implementation of straight Radix sort using 8b or > 16b at a time a= is NOT pathological, and b= is O(n). > Sedgewick is the most blunt about it on p142-143, but Knuth > discusses this as well. There can be a problem here with 2 & 3. Imagine the following query: SELECT company, division, plant, sum(revenue) FROM corporate GROUP BY company, division, plant ORDER BY company, division, plant Let's suppose further that company, division, and plant are all char 20. For a given section of company + division, the first 40 characters will be the same. So a radix or distribution sort will need 40 passes over the identical data before it even gets to the differentiating field plant. Now, a comparison sort will be O(n*log(n)). The counting passes alone for MSD radix sort will be 40 passes * N. How large will n have to be before Radix sort outperforms a comparison based sort? 2^40 = 1,099,511,627,776 About one trillion items appears to be break even. If they are fixed width, we could use LSD radix sort and count all the bins in one pass. But that technique will fail with varchar. Naturally, most long character fields will be varchar. The basic upshot is that radix and distribution sorts work best with short keys. > All of the above are stable, which quicksort is not. There > are no "pessimal" inputs for any of the above that will force > worst case behavior. For quicksort there are (they are > =very= unlikely however). In real world terms, if you can use > any of these approaches you should be able to internally sort > your data between 2X and 3X faster. > > > Unfortunately, most of us do not have the luxury of working > with Memory Resident DB's. In the real world, disk access is > an important part of our DB sorting efforts, and that changes > things. Very fast internal sorting routines combined with > multidisk merging algorithms that minimize overall disk I/O > while maximizing the sequential nature of that I/O are the > best we can do overall in such a situation. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
pgsql-hackers by date: