Thread: Tuning current tuplesort external sort code for 8.2

Tuning current tuplesort external sort code for 8.2

From
Simon Riggs
Date:
Based upon profiling of the initial stage of external sorting, it seems
that this stage is overall CPU bound, with hotspots in comparetup_*
accounting for around 50% of CPU time; lets just call that too much,
since your exact experience may vary.

Previously, I'd looked through all of the code with respect to the basic
algorithms. These are good, but it seemed very likely that there would
be opportunities to improve on the way things are, so I kept looking.

AFAICS the following opportunities exist, without changing any of the
theoretical algorithms or the flexibility of definable datatypes:

1. tuplesort_heap_siftup and tuplesort_heap_insert make no attempt to
cache the values of keys that have been obtained from *_getattr macros.
The two routines navigate a tournament sort heap, so that on average 50%
of comparisons use at least one immediately preceeding tuple and key
values from that could be cached ready for the next call. Caching would
reduce number of *_getattr calls from 2N to N+1, where N is likely to go
up on average linearly with work_mem. This would reduce the cost of
comparetup_ significantly. The inlined calls to myFunctionCall2 could
also take advantage of that caching to reduce pre-call setup by at least
50% also. (Only the first sort key attr would be cached, since the vast
majority of times only the first sort key will be checked. The sort does
use index number as first sort key at this time, but since this is run-
number, that isn't granular enough to reduce comparisons sufficiently).

All of the remaining ideas relate to NULL handling.

2. In comparetup_ the second attr value is always fetched, even when the
first attr is null. When the first attr is null the value of the second
need never be checked, just whether the second attr is null or not, so
the full cost of the *_getattr need not actually be paid at all. The
relevance of this is not reduced as a result of the caching suggested in
(1).

3. In a great many cases, sorts will be performed on non-nullable attrs,
e.g. PK indexes, many FK indexes, sort-merge joins based upon a FK that
is a subset of the PK (a typical one-many relationship) and groupings
also. In the majority of cases, these attrs are at the start of a tuple.
The *_getattr macros are particularly poor at handling NULLs. When
*_getattr sees *any* NULL is present for a tuple it checks the
nullability of all attrs up to the current attrnum before returning
using the cached offsets. The macro could be altered so that if the
current attrnum < firstNullableAttrnum (which we can set once for the
high level tupleDesc, rather than once per tuple) then we use the cached
offset, whether or not other nulls exist within the tuple. If not, then
we can start testing for nullability from the firstNullableAttrnum.
Currently, if we are *slow* according to nocachegetattr, i.e. there was
a prior NULL value, then we forget which one that was and go and re-
check them all from the start again. When slow, we could start
calculating the offset using the cached value of firstNull and then
working up from there. (All of that relates to the macros in general,
though they aren't actually used anymore apart from in various catalog
fetches and COPY TO)

Also, there is an opportunity to modify the run building with respect to
NULL values. Knuth's algorithm doesn't take into account 3VL at all, so
he might have wanted to do the following, if he could:

4. In an external sort we do a k passes through the total sort file.
During run building, the first comparison will reveal that a value has a
leading NULL sort key attr. NULLs always sort higher/lower than all
other values, yet are equal to each other. Once we know a tuple has a
leading NULL sort key attr, we could divert it to a NULL-holding file to
avoid involving it in many pointless comparisons and read/write I/Os.
Once all tuples have been emitted, the NULL-holding file can itself be
sorted recursively (starting at the second key etc.). At the end, the
NULL-holding file can be either read first or last, according to where
NULLs are placed (hi/lo/asc/desc). That technique might be of use when
we are trying to stay just inside memory, or when we have so many sort
passes that saving time on the NULLs could be a real saving; this would
likely be too much code for too little benefit. (This may be orthogonal
to the idea of using very large numbers of virtual tapes).

Other possibilities exist, most notable of which is the > 6 tape merge
already mentioned by Tom. None of the above conflicts with that change.

Assuming these sound good to all, I'll be starting to write up these
ideas in a couple of weeks, though comments on these specific code
suggestions are welcome now.

It may be possible to improve upon the basic theoretical algorithms, but
I'm not looking to try that right now. We've got enough ideas here to
make some good progress over next few months.

Best Regards, Simon Riggs



Re: Tuning current tuplesort external sort code for 8.2

From
Martijn van Oosterhout
Date:
On Mon, Oct 03, 2005 at 09:35:30PM +0100, Simon Riggs wrote:
> Based upon profiling of the initial stage of external sorting, it seems
> that this stage is overall CPU bound, with hotspots in comparetup_*
> accounting for around 50% of CPU time; lets just call that too much,
> since your exact experience may vary.

Indeed, however as I pointed out, if you arrange for
inlineApplySortFunction() actually be inlined, you can cut costs,
especially in the index creation case.

<snip>
> values from that could be cached ready for the next call. Caching would
> reduce number of *_getattr calls from 2N to N+1, where N is likely to go

My profiling indicates that the second getattr is half the cost of the
first, gcc optimisation at work. Note that setting CFLAGS=-pg for
configure disables optimisation, I missed that the first time.
Ofcourse, every call saved is time saved.

> 2. In comparetup_ the second attr value is always fetched, even when the
> first attr is null. When the first attr is null the value of the second
> need never be checked, just whether the second attr is null or not, so
> the full cost of the *_getattr need not actually be paid at all. The
> relevance of this is not reduced as a result of the caching suggested in
> (1).

Actually, attribute is null is the cheap case because you only need to
check the bitmap. But you could optimise stuff by expanding the
*_getattr calls and optimising directly. Possible problem with caching:
if you're called by the system qsort, can you assume anything about the
order of the comparisons?

Please note: if inlineApplySortFunction() is actually inlined (it isn't
by default), gcc does get very smart about this and sometimes optimises
out the Datum fetches depending on the isNull flags. So we need to
check we're actually making an improvement over the compiler.

<snip>

> is a subset of the PK (a typical one-many relationship) and groupings
> also. In the majority of cases, these attrs are at the start of a tuple.
> The *_getattr macros are particularly poor at handling NULLs. When
> *_getattr sees *any* NULL is present for a tuple it checks the
> nullability of all attrs up to the current attrnum before returning
> using the cached offsets. The macro could be altered so that if the
> current attrnum < firstNullableAttrnum (which we can set once for the

Maybe easier, in the macro use: bitmap & ((1<<attnum)-1) to quickly
check that no nulls precede the value we're looking for and hence we
can use the fast path anyway. Along the lines of:

#define index_getattr(tup, attnum, tupleDesc, isnull) \
( \       AssertMacro(PointerIsValid(isnull) && (attnum) > 0), \       *(isnull) = false, \
!IndexTupleHasNulls(tup)|| (attnum < 32 && (NullBitmap(tup) & ((1<<attnum)-1)) == 0 ) ? \       ( \
(tupleDesc)->attrs[(attnum)-1]->attcacheoff>= 0 ? \ 

Nice ideas though, a seperate run just for NULL keys is interesting. If
you only have one sort key it becomes a whole tape which doesn't need
to be sorted anymore, just emit it at the beginning or end. Could be
helpful.

Mind you, if you start creating seperate routines for different cases
you can go a long way. Elsewhere on this list I created a special case
for single-key integer index columns and got an 8% speed increase. Not
exactly a viable solution though.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Tuning current tuplesort external sort code for 8.2

From
Simon Riggs
Date:
On Mon, 2005-10-03 at 23:25 +0200, Martijn van Oosterhout wrote:
> Possible problem with caching:
> if you're called by the system qsort, can you assume anything about the
> order of the comparisons?

That applies only to the non-external sort case, which I'm not trying to
improve with these suggestions. (No, you can't assume that, it was
a heapsort only suggestion).

> Please note: if inlineApplySortFunction() is actually inlined (it isn't
> by default)

Can you explain your last post some more. Thats not what I get.

> > is a subset of the PK (a typical one-many relationship) and groupings
> > also. In the majority of cases, these attrs are at the start of a tuple.
> > The *_getattr macros are particularly poor at handling NULLs. When
> > *_getattr sees *any* NULL is present for a tuple it checks the
> > nullability of all attrs up to the current attrnum before returning
> > using the cached offsets. The macro could be altered so that if the
> > current attrnum < firstNullableAttrnum (which we can set once for the
> 
> Maybe easier, in the macro use: bitmap & ((1<<attnum)-1) to quickly
> check that no nulls precede the value we're looking for and hence we
> can use the fast path anyway. Along the lines of:

You may be right, the exact code that brings the right benefit is
somewhat trickier than spotting the opportunity.

> Mind you, if you start creating seperate routines for different cases
> you can go a long way. Elsewhere on this list I created a special case
> for single-key integer index columns and got an 8% speed increase. Not
> exactly a viable solution though.

But an interesting one. Once we've done everything else, that use case
is close to the top of my list, if the performance gain was still as
useful, all other things considered. 

Best Regards, Simon Riggs



Re: Tuning current tuplesort external sort code for 8.2

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> AFAICS the following opportunities exist, without changing any of the
> theoretical algorithms or the flexibility of definable datatypes:

> 1. tuplesort_heap_siftup and tuplesort_heap_insert make no attempt to
> cache the values of keys that have been obtained from *_getattr macros.
> The two routines navigate a tournament sort heap, so that on average 50%
> of comparisons use at least one immediately preceeding tuple and key
> values from that could be cached ready for the next call.

Hmm ... this seems interesting, but you also have to look at the
potential downside: what is the cost of doing the caching?

> All of the remaining ideas relate to NULL handling.

I can't get excited about this.  Most sort scenarios involve few if any
nulls.

One thought that comes to mind is that the current system structure
encourages sorting on keys that are at the end of their tuples.
For instance, "SELECT foo, bar FROM ... ORDER BY baz" will sort by
the third column of the generated tuples, which is certainly the least
efficient to access.  It'd be interesting to look into generating the
working tuples with baz as the first column.  I fear this is nontrivial,
because there are a lot of places that expect resjunk columns to come
last, but we should study it.  (Note: this will do nada for Josh's
original complaint about index build time, since btree index sorts will
by definition use all the tuple's columns, but it seems like a
significant issue for ordinary sorts.)
        regards, tom lane


Re: Tuning current tuplesort external sort code for 8.2

From
Simon Riggs
Date:
On Tue, 2005-10-04 at 00:21 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > AFAICS the following opportunities exist, without changing any of the
> > theoretical algorithms or the flexibility of definable datatypes:
> 
> > 1. tuplesort_heap_siftup and tuplesort_heap_insert make no attempt to
> > cache the values of keys that have been obtained from *_getattr macros.
> > The two routines navigate a tournament sort heap, so that on average 50%
> > of comparisons use at least one immediately preceeding tuple and key
> > values from that could be cached ready for the next call.
> 
> Hmm ... this seems interesting, but you also have to look at the
> potential downside: what is the cost of doing the caching?

Actually, nothing for the cache-1-tuple case. We just re-initialise the
fcinfo based around whether we are caching 0,1 or 2 fields. We keep two
fcinfo structures, so that when we cache=0 fields we do not pollute the
cache for the cache=1 case. 

tuplesort_heap_insert allows us to cache one value each time round the
inner loop, always using fcinfo1. (cache=1). tuplesort_heap_siftup has
two comparisons per iteration; it allows us to cache 0 values on the
first call, so we use fcinfo0. We can always cache one value for the
second call, exactly same as _heap_insert case, so we use fcinfo1. Maybe
we can also reuse the winner from the first call as the second input to
the second call. ehem...It's easier to see when you look at the code.

We need to work a bit harder to get the cache=2 case but its possible,
though I take your point that it may not be worth it. I'll do the
cache=1 case before attempting the cache=2 case, so we can measure the
gain.

We run _heap_insert and _heap_siftup once for each incoming tuple, so
the saving should be good.

> > All of the remaining ideas relate to NULL handling.
> 
> I can't get excited about this. Most sort scenarios involve few if any
> nulls.

The idea would be much less important anyway if you follow up on this
next idea:

> One thought that comes to mind is that the current system structure
> encourages sorting on keys that are at the end of their tuples.
> For instance, "SELECT foo, bar FROM ... ORDER BY baz" will sort by
> the third column of the generated tuples, which is certainly the least
> efficient to access.  

Not sure what goes on in there. Are you saying heap sort cols are always
at the end, or only in the cases you mention? What about where we're
doing a GROUP BY, so all the sort columns are always part of the select
list and frequently (by coding convention) at the front of the row?

If the cols aren't always in resjunk we can come up with some
performance test cases to check whether or not this is a problem, how
big and where it starts to show itself.

> It'd be interesting to look into generating the
> working tuples with baz as the first column.  I fear this is nontrivial,
> because there are a lot of places that expect resjunk columns to come
> last, but we should study it.  

It could be worth it to arrange the main select's attr list so that the
sort keys always come first. I'd settle just for that if the second part
is too much work.

It sounds like hard work to optimise the case where the ORDER BY is on
columns not mentioned in the select. The latter seems worth it for when
we do a sort-merge based on join columns not mentioned in the SELECT,
but not for optimizing sort-aggregate or one-table-select cases.

> (Note: this will do nada for Josh's
> original complaint about index build time, since btree index sorts will
> by definition use all the tuple's columns, but it seems like a
> significant issue for ordinary sorts.)

Understood, but we do care about heap sorts too! Heap and index sorts
have many dissimilarities, but both are important, so perhaps we should
tune for each case separately.

Best Regards, Simon Riggs



Re: Tuning current tuplesort external sort code for 8.2

From
Martijn van Oosterhout
Date:
On Tue, Oct 04, 2005 at 12:37:51AM +0100, Simon Riggs wrote:
> On Mon, 2005-10-03 at 23:25 +0200, Martijn van Oosterhout wrote:
> > Please note: if inlineApplySortFunction() is actually inlined (it isn't
> > by default)
>
> Can you explain your last post some more. Thats not what I get.

The inline keyword is just a flag that you would like the compiler to
inline it. GCC will decide itself. It has a limit on the size of
functions that it will consider for inlining.

Quote the gcc manual:

-finline-limit-N    By default, gcc limits the size of functions that can be inlined.    This flag allows the control
ofthis limit for functions that are    explicitly marked as inline (ie marked with the inline keyword or    defined
withinthe class definition in c++).  N is the size of    functions that can be inlined in number of pseudo instructions
  (not counting parameter handling).   

It goes in to say that the limit is 10000 for gcc 2.95, but if you
examine the manual for gcc 3.3 it has the limit at 600. So it's
entirely possible that at the time the person wrote that code, it *was*
being inlined, but it sure isn't on some versions of some compilers. I
experimented and found that -finline-limit-1500 causes it to start
inlining.

The -Winline flag causes gcc to print a warning if you specified
"inline" but gcc didn't inline it, for whatever reason.

Hopefully this clears that up.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Tuning current tuplesort external sort code for 8.2

From
Martijn van Oosterhout
Date:
On Tue, Oct 04, 2005 at 11:55:58AM +0200, Martijn van Oosterhout wrote:
> It goes in to say that the limit is 10000 for gcc 2.95, but if you
> examine the manual for gcc 3.3 it has the limit at 600. So it's
> entirely possible that at the time the person wrote that code, it *was*
> being inlined, but it sure isn't on some versions of some compilers. I
> experimented and found that -finline-limit-1500 causes it to start
> inlining.

From searching the web, it appears the inline limit was dropped from
10000 to 600 between gcc 3.0.0 and 3.0.1 in response to complaints
about gcc memory usage. Any function that could be inlined needed to be
kept in memory in semicompiled form and in C++ where lots of inlinable
functions call eachother, the memory usage blew up completely.

The difference between -O2 and -O3 is that the latter will consider any
function for inlining, even if you didn't ask. For C programs that
basically means any function declared static. Also, the number is
"pseudo-instructions" and their meaning can change from version to
version.

Since we're pretty much relying on gcc to inline for performance, I
still think we should add -Winline by default so we can tell when it's
not doing what we want.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Tuning current tuplesort external sort code for 8.2

From
Simon Riggs
Date:
On Tue, 2005-10-04 at 12:37 +0200, Martijn van Oosterhout wrote:

> Since we're pretty much relying on gcc to inline for performance, I
> still think we should add -Winline by default so we can tell when it's
> not doing what we want.

Very much agree to this. Can we put that in the build for 8.1 please?

People need to know we expect certain code to be inlined and they need
warnings when this does not occur.

Best Regards, Simon Riggs