From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Sep 23, 2005 2:15 PM
Subject: Re: [PERFORM] Releasing memory during External sorting?
>Mark Lewis <mark.lewis@mir3.com> writes:
>> operations != passes. If you were clever, you could probably write a
>> modified bubble-sort algorithm that only made 2 passes. A pass is a
>> disk scan, operations are then performed (hopefully in memory) on what
>> you read from the disk. So there's no theoretical log N lower-bound on
>> the number of disk passes.
>Given infinite memory that might be true, but I don't think I believe it
>for limited memory. If you have room for K tuples in memory then it's
>impossible to perform more than K*N useful comparisons per pass (ie, as
>each tuple comes off the disk you can compare it to all the ones
>currently in memory; anything more is certainly redundant work). So if
>K < logN it's clearly not gonna work.
>
Actually, it's far better than that. I recall a paper I saw in one of the
algorithms journals 15+ years ago that proved that if you knew the range
of the data, regardless of what that range was, and had n^2 space, you
could sort n items in O(n) time.
Turns out that with very modest constraints on the range of the data and
substantially less extra space (about the same as you'd need for
Replacement Selection + External Merge Sort), you can _still_ sort in
O(n) time.
>It's possible that you could design an algorithm that works in a fixed
>number of passes if you are allowed to assume you can hold O(log N)
>tuples in memory --- and in practice that would probably work fine,
>if the constant factor implied by the O() isn't too big. But it's not
>really solving the general external-sort problem.
>
If you know nothing about the data to be sorted and must guard against
the worst possible edge cases, AKA the classic definition of "the general
external sorting problem", then one can't do better than some variant
of Replacement Selection + Unbalanced Multiway Merge.
OTOH, ITRW things are _not_ like that. We know the range of the data
in our DB fields or we can safely assume it to be relatively constrained.
This allows us access to much better external sorting algorithms.
For example Postman Sort (the 2005 winner of the PennySort benchmark)
is basically an IO optimized version of an external Radix Sort.
Ron