Thread: WIP: further sorting speedup

WIP: further sorting speedup

From
Tom Lane
Date:
After applying Simon's recent sort patch, I was doing some profiling and
noticed that sorting spends an unreasonably large fraction of its time
extracting datums from tuples (heap_getattr or index_getattr).  The
attached patch does something about this by pulling out the leading sort
column of a tuple when it is received by the sort code or re-read from a
"tape".  This increases the space needed by 8 or 12 bytes (depending on
sizeof(Datum)) per in-memory tuple, but doesn't cost anything as far as
the on-disk representation goes.  The effort needed to extract the datum
at this point is well repaid because the tuple will normally undergo
multiple comparisons while it remains in memory.  In some quick tests
the patch seemed to make for a significant speedup, on the order of 30%,
despite increasing the number of runs emitted because of the smaller
available memory.

The choice to pull out just the leading column, rather than all columns,
is driven by concerns of (a) code complexity and (b) memory space.
Having the extra columns pre-extracted wouldn't buy anything anyway
in the common case where the leading key determines the result of
a comparison.

This is still WIP because it leaks memory intra-query (I need to fix it
to clean up palloc'd space better).  I thought I'd post it now in case
anyone wants to try some measurements for their own favorite test cases.
In particular it would be interesting to see what happens for a
multi-column sort with lots of duplicated keys in the first column,
which is the case where the least advantage would be gained.

Comments?

            regards, tom lane


Attachment

Re: WIP: further sorting speedup

From
"Luke Lonergan"
Date:
Cool!

We’ll test this sometime soon and get back to you.  We’re kind of jammed this week, hopefully we’ll get some time.

So you know, we’ve done some more work on the external sort to remove the “tape” abstraction from the code, which makes a significant improvement.  This involved removing both the Knuth tapes, and the logtape.c codepath.  The result is a reasonable improvement in performance (tens of percent), and a dramatic reduction in the amount of code.

Since we’re looking for a 4-fold improvement based on comparisons to “other commercial databases”, we feel we’re not done yet.  Our next step (before we got jammed getting our latest MPP release out) was to implement these:
  • Locate the cause for the excessive time in heap_getattr (you just did it)
  • Implement something other than replacement selection for creating runs to optimize cache use

- Luke

On 2/19/06 6:40 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

After applying Simon's recent sort patch, I was doing some profiling and
noticed that sorting spends an unreasonably large fraction of its time
extracting datums from tuples (heap_getattr or index_getattr).  The
attached patch does something about this by pulling out the leading sort
column of a tuple when it is received by the sort code or re-read from a
"tape".  This increases the space needed by 8 or 12 bytes (depending on
sizeof(Datum)) per in-memory tuple, but doesn't cost anything as far as
the on-disk representation goes.  The effort needed to extract the datum
at this point is well repaid because the tuple will normally undergo
multiple comparisons while it remains in memory.  In some quick tests
the patch seemed to make for a significant speedup, on the order of 30%,
despite increasing the number of runs emitted because of the smaller
available memory.

The choice to pull out just the leading column, rather than all columns,
is driven by concerns of (a) code complexity and (b) memory space.
Having the extra columns pre-extracted wouldn't buy anything anyway
in the common case where the leading key determines the result of
a comparison.

This is still WIP because it leaks memory intra-query (I need to fix it
to clean up palloc'd space better).  I thought I'd post it now in case
anyone wants to try some measurements for their own favorite test cases.
In particular it would be interesting to see what happens for a
multi-column sort with lots of duplicated keys in the first column,
which is the case where the least advantage would be gained.

Comments?

                        regards, tom lane



Re: WIP: further sorting speedup

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> So you know, we=B9ve done some more work on the external sort to remove the
> =B3tape=B2 abstraction from the code, which makes a significant improvement.

Improvement where?  That code's down in the noise so far as I can tell.
I see results like this (with the patched code):

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
147310   31.9110  tuplesort_heap_siftup
68381    14.8130  comparetup_index
34063     7.3789  btint4cmp
22573     4.8899  AllocSetAlloc
19317     4.1845  writetup_index
18953     4.1057  tuplesort_gettuple_common
18100     3.9209  mergepreread
17083     3.7006  GetMemoryChunkSpace
12527     2.7137  LWLockAcquire
11686     2.5315  LWLockRelease
6172      1.3370  tuplesort_heap_insert
5392      1.1680  index_form_tuple
5323      1.1531  PageAddItem
4943      1.0708  LogicalTapeWrite
4525      0.9802  LogicalTapeRead
4487      0.9720  LockBuffer
4217      0.9135  heapgettup
3891      0.8429  IndexBuildHeapScan
3862      0.8366  ltsReleaseBlock

It appears that a lot of the cycles blamed on tuplesort_heap_siftup are
due to cache misses associated with referencing memtuples[] entries
that have fallen out of L2 cache.  Not sure how to improve that though.

            regards, tom lane

Re: WIP: further sorting speedup

From
"Luke Lonergan"
Date:
The improvement was pre-Simon’s patch, and it came from implementing a single pass merge instead of a variable pass based on the number of tapes, as it is in Knuth’s tape algorithm.  Also, the additional tricks in logtape.c were higher in the profile than what I see here.

Simon’s patch had the effect of reducing the number of passes by increasing the number of tapes depending on the memory available, but that’s a long tail effect as seen in figure (70?) in Knuth.

Where I’d like this to go is the implementation of a two pass “create runs, merge”, where the second merge can be avoided unless random access is needed (as discussed previously on list).

In the run creation phase, the idea would be to implement something like quicksort or another L2-cache friendly algorithm (ideas?)

- Luke


On 2/19/06 8:19 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

"Luke Lonergan" <llonergan@greenplum.com> writes:
> So you know, we=B9ve done some more work on the external sort to remove the
> =B3tape=B2 abstraction from the code, which makes a significant improvement.

Improvement where?  That code's down in the noise so far as I can tell.
I see results like this (with the patched code):

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
147310   31.9110  tuplesort_heap_siftup
68381    14.8130  comparetup_index
34063     7.3789  btint4cmp
22573     4.8899  AllocSetAlloc
19317     4.1845  writetup_index
18953     4.1057  tuplesort_gettuple_common
18100     3.9209  mergepreread
17083     3.7006  GetMemoryChunkSpace
12527     2.7137  LWLockAcquire
11686     2.5315  LWLockRelease
6172      1.3370  tuplesort_heap_insert
5392      1.1680  index_form_tuple
5323      1.1531  PageAddItem
4943      1.0708  LogicalTapeWrite
4525      0.9802  LogicalTapeRead
4487      0.9720  LockBuffer
4217      0.9135  heapgettup
3891      0.8429  IndexBuildHeapScan
3862      0.8366  ltsReleaseBlock

It appears that a lot of the cycles blamed on tuplesort_heap_siftup are
due to cache misses associated with referencing memtuples[] entries
that have fallen out of L2 cache.  Not sure how to improve that though.

                        regards, tom lane



Re: WIP: further sorting speedup

From
Simon Riggs
Date:
On Sun, 2006-02-19 at 21:40 -0500, Tom Lane wrote:
> After applying Simon's recent sort patch, I was doing some profiling and
> noticed that sorting spends an unreasonably large fraction of its time
> extracting datums from tuples (heap_getattr or index_getattr).  The
> attached patch does something about this by pulling out the leading sort
> column of a tuple when it is received by the sort code or re-read from a
> "tape".  This increases the space needed by 8 or 12 bytes (depending on
> sizeof(Datum)) per in-memory tuple, but doesn't cost anything as far as
> the on-disk representation goes.  The effort needed to extract the datum
> at this point is well repaid because the tuple will normally undergo
> multiple comparisons while it remains in memory.  In some quick tests
> the patch seemed to make for a significant speedup, on the order of 30%,
> despite increasing the number of runs emitted because of the smaller
> available memory.

Yeh, this is essentially the cache-the-heapgetattr idea. I'd been trying
to get that to perform by caching the fcinfo values, which was a more
complex way and relied on full key extraction. To my chagrin, I had
great difficulty that way, but now I see the benefit of first-key
extraction explains why. The 30% speedup sounds like my original
expectation. Anyway, kudos to you.

[I'd stopped working on that to give Tim some space]

> The choice to pull out just the leading column, rather than all columns,
> is driven by concerns of (a) code complexity and (b) memory space.
> Having the extra columns pre-extracted wouldn't buy anything anyway
> in the common case where the leading key determines the result of
> a comparison.

I think key extraction is a good idea, for more reasons than just the
heapgetattr.

For longer heap rows, putting the whole row through the sort is inferior
to extracting all keys plus a pointer to the tuple, according to:

"AlphaSort: A Cache-Sensitive Parallel External
Sort", Nyberg et al, VLDB Journal 4(4): 603-627 (1995)

The above paper makes a good case that full key extraction is a great
idea above a tuple length of 16 bytes, i.e. we don't need to do it for
most CREATE INDEX situations, but it would be very helpful for heap
sorts.

I agree that as long as we are swamped by the cost of heapgetattr, then
it does seem likely that first-key extraction (and keeping it with the
tuple itself) will be a win in most cases over full-key extraction.

Nyberg et al also touch on a further point, which Luke has just
mentioned briefly on list (but we have discussed at further length). Now
that we have dropped the restriction of N=6, giving very large numbers
of runs, this also weakens the argument as to why heapsort is a good
candidate for sort algorithm. The reason for choosing heapsort was that
it gave runs ~2*Size(memory), whereas qsort produces runs only
~Size(memory). But if we have as many runs as we like, then using qsort
is not an issue. Which is good because qsort is faster in practice and
much more importantly, performs better with larger memory: heap sort
seems to suffer from a slow down when you give it *too much* memory.

Which leaves the conclusion: further tuning of the heapsort mechanism is
*probably* not worthwhile in relation to the run forming stage of
sorting. (The variable nature of the qsort algorithm might seem an
issue, but if we execute it N times for N runs, then we'll get a much
less variable performance from it than we saw on those individual tests
earlier, so the predictability of the heapsort isn't as important a
reason to keep it).

But it seems this patch provides a win that is not dependent upon the
sort algorithm used, so its a win whatever we do.

(We still need the heapsort for the final merge, so I think we still
need to look at tuning of the final merge stage when we have a very
large work_mem setting, or simply limiting the size of the heap used
above a certain point.)

> This is still WIP because it leaks memory intra-query (I need to fix it
> to clean up palloc'd space better).  I thought I'd post it now in case
> anyone wants to try some measurements for their own favorite test cases.

> In particular it would be interesting to see what happens for a
> multi-column sort with lots of duplicated keys in the first column,
> which is the case where the least advantage would be gained.

Hmmm, well it seems clear that there will be an optimum number of keys
to be pre-extracted for any particular sort, though we will not be able
to tell what that is until we are mid-way through the sort. Using
heapsort we wouldn't really have much opportunity to decide to change
the number of columns pre-extracted...if we did use qsort, we could
learn from the last run how many keys to pre-extract.

Incidentally, I do think heapgetattr can be tuned further. When a tuple
has nulls we never use cached offset values. However, we could work out
the firstnullableattno for a table/resultset and keep cached offset
values for all attnum < firstnullableattno. If the tuple contains nulls,
we would just add the bytes for the null flags onto the offset. Non
nullable columns tend to be at the front of rows and also tend to be
targets for sort keys.

Best Regards, Simon Riggs


Re: WIP: further sorting speedup

From
Michael Glaesemann
Date:
On Feb 21, 2006, at 3:45 , Simon Riggs wrote:

> On Sun, 2006-02-19 at 21:40 -0500, Tom Lane wrote:
>> After applying Simon's recent sort patch, I was doing some
>> profiling and
>> noticed that sorting spends an unreasonably large fraction of its
>> time
>> extracting datums from tuples (heap_getattr or index_getattr).  The
>> attached patch does something about this by pulling out the
>> leading sort
>> column of a tuple when it is received by the sort code or re-read
>> from a
>> "tape".

<snip />

>> The choice to pull out just the leading column, rather than all
>> columns,
>> is driven by concerns of (a) code complexity and (b) memory space.
>> Having the extra columns pre-extracted wouldn't buy anything anyway
>> in the common case where the leading key determines the result of
>> a comparison.

<snip />

> I agree that as long as we are swamped by the cost of heapgetattr,
> then
> it does seem likely that first-key extraction (and keeping it with the
> tuple itself) will be a win in most cases over full-key extraction.

Most of this is way above my head, but I'm trying to follow along:
when you say first key and full key, are these related to relation
keys (e.g., primary key) or attributes that are used in sorting
(regardless of whether they're a key or not)? I notice Tom used the
term "leading [sort] column", which I read to mean the first
attribute used to sort the relation (for whichever purpose, e.g.,
mergejoins, order-by clauses). I'll see if I can't find the Nyberg
paper as well to learn a bit more. (I haven't been sleeping well
recently.)

Michael Glaesemann
grzm myrealbox com




Re: WIP: further sorting speedup

From
Tom Lane
Date:
Michael Glaesemann <grzm@myrealbox.com> writes:
> Most of this is way above my head, but I'm trying to follow along:
> when you say first key and full key, are these related to relation
> keys (e.g., primary key) or attributes that are used in sorting
> (regardless of whether they're a key or not)? I notice Tom used the
> term "leading [sort] column", which I read to mean the first
> attribute used to sort the relation (for whichever purpose, e.g.,
> mergejoins, order-by clauses).

Right, it's whatever is the sort key for this particular sort.
You could have "SELECT ... ORDER BY foo,bar,baz", or you could
have construction of a multi-column btree index.  In either case,
the sort module is given a set of tuples and told to sort by
certain specified column(s) of those tuples.  What I saw in
profiling was that a large fraction of the CPU time was going
into heap_getattr (or index_getattr, for the index-tuple case)
calls to extract Datums for the sort columns.  The Datums are
then passed to the data-type-specific comparison functions,
such as btint4cmp.  In the original code we did this every time
we compared two tuples.  But a tuple is normally compared multiple
times during a sort (about logN times, in fact), so it makes sense
to do the Datum extraction just once and save the value to use
in comparisons.  The question at hand here is whether to pre-extract
Datums for each column when there are multiple sort columns, or
just extract the first column (which is often all you need for
a comparison anyway).  I think the one-column approach wins
because it keeps the sort data associated with a tuple fixed-size.
Extracting all columns would require a more complex data structure
... plus it would take more memory, and memory space is at a premium
here.

            regards, tom lane

Re: WIP: further sorting speedup

From
Michael Glaesemann
Date:
On Feb 21, 2006, at 14:24 , Tom Lane wrote:

> Michael Glaesemann <grzm@myrealbox.com> writes:
>> Most of this is way above my head, but I'm trying to follow along:

> Right, it's whatever is the sort key for this particular sort.


Thanks, Tom. I think I may actually be starting to understand this a
bit!

Michael Glaesemann
grzm myrealbox com




Re: WIP: further sorting speedup

From
"Jim C. Nasby"
Date:
On Sun, Feb 19, 2006 at 09:40:46PM -0500, Tom Lane wrote:
> "tape".  This increases the space needed by 8 or 12 bytes (depending on
> sizeof(Datum)) per in-memory tuple, but doesn't cost anything as far as

Stupid question... how are varlena's handled in Datum?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WIP: further sorting speedup

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Stupid question... how are varlena's handled in Datum?

The Datum is just a pointer into the original tuple in that case.

            regards, tom lane

Re: WIP: further sorting speedup

From
"Jim C. Nasby"
Date:
On Wed, Feb 22, 2006 at 10:18:48PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Stupid question... how are varlena's handled in Datum?
>
> The Datum is just a pointer into the original tuple in that case.

That would still result in a speedup since you don't have to figure out
where the field begins though, right? I'm curious as to how much this
patch would help with sorting text...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: WIP: further sorting speedup

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Wed, Feb 22, 2006 at 10:18:48PM -0500, Tom Lane wrote:
>> The Datum is just a pointer into the original tuple in that case.

> That would still result in a speedup since you don't have to figure out
> where the field begins though, right? I'm curious as to how much this
> patch would help with sorting text...

Right.  It would help just as much as in the integer case as far as
eliminating the time spent in heap_getattr is concerned.  That would be
a smaller percentage of the whole, because bttextcmp is (a lot) slower
than btint4cmp, but that's no fault of the patch.

            regards, tom lane

Re: WIP: further sorting speedup

From
"Jim C. Nasby"
Date:
On Thu, Feb 23, 2006 at 01:15:32PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Wed, Feb 22, 2006 at 10:18:48PM -0500, Tom Lane wrote:
> >> The Datum is just a pointer into the original tuple in that case.
>
> > That would still result in a speedup since you don't have to figure out
> > where the field begins though, right? I'm curious as to how much this
> > patch would help with sorting text...
>
> Right.  It would help just as much as in the integer case as far as
> eliminating the time spent in heap_getattr is concerned.  That would be
> a smaller percentage of the whole, because bttextcmp is (a lot) slower
> than btint4cmp, but that's no fault of the patch.

Certainly, thanks for the explanation. Of course this does raise the
question of if it would be worthwhile to cache the first X bytes of a
varlena, but that's a different fish to fry.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461