On Tue, Mar 20, 2012 at 5:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> No. It does the opposite: it slows it down. This is a highly
> surprising result but it's quite repeatable: removing comparisons
> makes it slower. As previously pontificated, I think this is probably
> because the heap can fill up with next-run tuples that are cheap to
> compare against, and that spares us having to do "real" comparisons
> involving the actual datatype comparators.
Frankly that analysis didn't make any sense to me at the time.
Comparing integers is fast, sure, but it's still slower than not
having to do any comparison at all.
It's just barely conceivable that it's a cache effect on the *code*
but even that seems pretty darned unlikely to me. My money currently
is on a measurement error but that's just a guess since I haven't
tried to replicate any of your results.
>
>> BTW, there's a link at the bottom of the wikipedia page to a very
>> interesting ACM Queue article, which argues that the binary-tree
>> data structure isn't terribly well suited to virtual memory because
>> it touches random locations in succession. I'm not sure I believe
>> his particular solution, but I'm wondering about B+ trees, ie more
>> than 2 children per node.
>
> I don't think virtual memory locality is the problem. I read
> somewhere that a ternary heap is supposed to be about one-eighth
> faster than a binary heap, but that's because picking the smallest of
> three tuples requires two comparisons, whereas picking the smallest of
> four tuples requires three comparisons, which is better.
The things I was reading suggested 4-heap was more cache efficient
which is a heap with 4-children per node and still representable as a
flat array. There's also Fibonacci heap which makes insertions really
fast but since we're doing exactly as many insertions as deletions the
question is how much work it adds to deletions. It's still O(logn)
amortized but it may be 2x the constant factor.
Fwiw I think more interesting than improving tapesort would be
implementing wholly different algorithms like radix sort or the
repeated quicksort. Being able to handle different algorithms that
require a different API would be the first step to being able to
handle parallel algorithms using threads or GPUs.
I also wonder if disabling the space reuse and just generating
separate files for each run might not be a win on exotic filesystems
like ZFS or WAFL where overwriting blocks is more expensive than
allocating new blocks.
--
greg