Tuning current tuplesort external sort code for 8.2 - Mailing list pgsql-hackers

From Simon Riggs
Subject Tuning current tuplesort external sort code for 8.2
Date
Msg-id 1128371730.8603.117.camel@localhost.localdomain
Whole thread Raw
Responses Re: Tuning current tuplesort external sort code for 8.2
Re: Tuning current tuplesort external sort code for 8.2
List pgsql-hackers
Based upon profiling of the initial stage of external sorting, it seems
that this stage is overall CPU bound, with hotspots in comparetup_*
accounting for around 50% of CPU time; lets just call that too much,
since your exact experience may vary.

Previously, I'd looked through all of the code with respect to the basic
algorithms. These are good, but it seemed very likely that there would
be opportunities to improve on the way things are, so I kept looking.

AFAICS the following opportunities exist, without changing any of the
theoretical algorithms or the flexibility of definable datatypes:

1. tuplesort_heap_siftup and tuplesort_heap_insert make no attempt to
cache the values of keys that have been obtained from *_getattr macros.
The two routines navigate a tournament sort heap, so that on average 50%
of comparisons use at least one immediately preceeding tuple and key
values from that could be cached ready for the next call. Caching would
reduce number of *_getattr calls from 2N to N+1, where N is likely to go
up on average linearly with work_mem. This would reduce the cost of
comparetup_ significantly. The inlined calls to myFunctionCall2 could
also take advantage of that caching to reduce pre-call setup by at least
50% also. (Only the first sort key attr would be cached, since the vast
majority of times only the first sort key will be checked. The sort does
use index number as first sort key at this time, but since this is run-
number, that isn't granular enough to reduce comparisons sufficiently).

All of the remaining ideas relate to NULL handling.

2. In comparetup_ the second attr value is always fetched, even when the
first attr is null. When the first attr is null the value of the second
need never be checked, just whether the second attr is null or not, so
the full cost of the *_getattr need not actually be paid at all. The
relevance of this is not reduced as a result of the caching suggested in
(1).

3. In a great many cases, sorts will be performed on non-nullable attrs,
e.g. PK indexes, many FK indexes, sort-merge joins based upon a FK that
is a subset of the PK (a typical one-many relationship) and groupings
also. In the majority of cases, these attrs are at the start of a tuple.
The *_getattr macros are particularly poor at handling NULLs. When
*_getattr sees *any* NULL is present for a tuple it checks the
nullability of all attrs up to the current attrnum before returning
using the cached offsets. The macro could be altered so that if the
current attrnum < firstNullableAttrnum (which we can set once for the
high level tupleDesc, rather than once per tuple) then we use the cached
offset, whether or not other nulls exist within the tuple. If not, then
we can start testing for nullability from the firstNullableAttrnum.
Currently, if we are *slow* according to nocachegetattr, i.e. there was
a prior NULL value, then we forget which one that was and go and re-
check them all from the start again. When slow, we could start
calculating the offset using the cached value of firstNull and then
working up from there. (All of that relates to the macros in general,
though they aren't actually used anymore apart from in various catalog
fetches and COPY TO)

Also, there is an opportunity to modify the run building with respect to
NULL values. Knuth's algorithm doesn't take into account 3VL at all, so
he might have wanted to do the following, if he could:

4. In an external sort we do a k passes through the total sort file.
During run building, the first comparison will reveal that a value has a
leading NULL sort key attr. NULLs always sort higher/lower than all
other values, yet are equal to each other. Once we know a tuple has a
leading NULL sort key attr, we could divert it to a NULL-holding file to
avoid involving it in many pointless comparisons and read/write I/Os.
Once all tuples have been emitted, the NULL-holding file can itself be
sorted recursively (starting at the second key etc.). At the end, the
NULL-holding file can be either read first or last, according to where
NULLs are placed (hi/lo/asc/desc). That technique might be of use when
we are trying to stay just inside memory, or when we have so many sort
passes that saving time on the NULLs could be a real saving; this would
likely be too much code for too little benefit. (This may be orthogonal
to the idea of using very large numbers of virtual tapes).

Other possibilities exist, most notable of which is the > 6 tape merge
already mentioned by Tom. None of the above conflicts with that change.

Assuming these sound good to all, I'll be starting to write up these
ideas in a couple of weeks, though comments on these specific code
suggestions are welcome now.

It may be possible to improve upon the basic theoretical algorithms, but
I'm not looking to try that right now. We've got enough ideas here to
make some good progress over next few months.

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: effective SELECT from child tables
Next
From: Josh Berkus
Date:
Subject: Re: [PERFORM] A Better External Sort?