Re: qsort again (was Re: [PERFORM] Strange Create - Mailing list pgsql-hackers

From Ron
Subject Re: qsort again (was Re: [PERFORM] Strange Create
Date
Msg-id 7.0.1.0.2.20060216100954.035b5ee0@earthlink.net
Whole thread Raw
In response to Re: qsort again (was Re: [PERFORM] Strange Create Index  (Martijn van Oosterhout <kleptog@svana.org>)
Responses Re: qsort again (was Re: [PERFORM] Strange Create
Re: qsort again (was Re: [PERFORM] Strange Create
List pgsql-hackers
At 09:48 AM 2/16/2006, Martijn van Oosterhout wrote:
>On Thu, Feb 16, 2006 at 08:22:55AM -0500, Ron wrote:
> > 3= Especially in modern systems where the gap between internal CPU
> > bandwidth and memory bandwidth is so great, the overhead of memory
> > accesses for comparisons and moves is the majority of the overhead
> > for both the pivot choosing and the partitioning algorithms within
> > quicksort.  Particularly random memory accesses.  The reason (#GPRs -
> > 1) is a magic constant is that it's the most you can compare and move
> > using only register-to-register operations.
>
>But how much of this applies to us? We're not sorting arrays of
>integers, we're sorting pointers to tuples. So while moves cost very
>little, a comparison costs hundreds, maybe thousands of cycles. A tuple
>can easily be two or three cachelines and you're probably going to
>access all of it, not to mention the Fmgr structures and the Datums
>themselves.
Pointers are simply fixed size 32b or 64b quantities.  They are
essentially integers.  Comparing and moving pointers or fixed size
keys to those pointers is exactly the same problem as comparing and
moving integers.

Comparing =or= moving the actual data structures is a much more
expensive and variable cost proposition.  I'm sure that pg's sort
functionality is written intelligently enough that the only real data
moves are done in a final pass after the exact desired order has been
found using pointer compares and (re)assignments during the sorting
process.  That's a standard technique for sorting data whose "key" or
pointer is much smaller than a datum.

Your cost comment basically agrees with mine regarding the cost of
random memory accesses.  The good news is that the number of datums
to be examined during the pivot choosing process is small enough that
the datums can fit into CPU cache while the pointers to them can be
assigned to registers: making pivot choosing +very+ fast when done correctly.

As you've noted, actual partitioning is going to be more expensive
since it involves accessing enough actual datums that they can't all
fit into CPU cache.  The good news is that QuickSort has a very
sequential access pattern within its inner loop.  So while we must go
to memory for compares, we are at least keeping the cost for it down
it a minimum.  In addition, said access is nice enough to be very
prefetch and CPU cache hierarchy friendly.


>None of this is cache friendly. The actual tuples themselves could be
>spread all over memory (I don't think any particular effort is expended
>trying to minimize fragmentation).
It probably would be worth it to spend some effort on memory layout
just as we do for HD layout.


>Do these algorithms discuss the case where a comparison is more than
>1000 times the cost of a move?
A move is always more expensive than a compare when the datum is
larger than its pointer or key.  A move is always more expensive than
a compare when it involves memory to memory movement rather than CPU
location to CPU location movement.  A move is especially more
expensive than a compare when it involves both factors.  Most moves
do involve both.

What I suspect you meant is that a key comparison that involves
accessing the data in memory is more expensive than reassigning the
pointers associated with those keys.   That is certainly true.

Yes.  The problem has been extensively studied. ;-)


>Where this does become interesting is where we can convert a datum to
>an integer such that if f(A) > f(B) then A > B. Then we can sort on
>f(X) first with just integer comparisons and then do a full tuple
>comparison only if f(A) = f(B). This would be much more cache-coherent
>and make these algorithms much more applicable in my mind.
In fact we can do better.
Using hash codes or what-not to map datums to keys and then sorting
just the keys and the pointers to those datums followed by an
optional final pass where we do the actual data movement is also a
standard technique for handling large data structures.


Regardless of what tweaks beyond the basic algorithms we use, the
algorithms themselves have been well studied and their performance
well established.  QuickSort is the best performing of the O(nlgn)
comparison based sorts and it uses less resources than HeapSort or MergeSort.

Ron



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Generating config stuff from single source
Next
From: Tom Lane
Date:
Subject: Re: qsort again (was Re: [PERFORM] Strange Create