Thread: SP-GiST micro-optimizations

SP-GiST micro-optimizations

From
Heikki Linnakangas
Date:
I did some performance testing of building an SP-GiST index, with the
new range type SP-GiST opclass. There's some low-hanging fruit there, I
was able to reduce the index build time on a simple test case by about
20% with a few small changes.

I created a test table with:

create table range_test AS SELECT int4range(i-10, i + 10) as r from
generate_series(1, 100000) i;

And measured the time it takes to build an index on that, on my laptop,
by repeating this a few times and taking the lowest value:

\timing
create index i_r on range_test using spgist (r);

On unpatched checkout from master, the shortest time was 19.2 seconds.

Profile taken with 'perf' tool looks like this:

      21,43%  postmaster  postgres           [.] spgdoinsert
      17,25%  postmaster  postgres           [.] range_deserialize
      10,28%  postmaster  postgres           [.] FunctionCall2Coll
       9,68%  postmaster  postgres           [.] spgExtractNodeLabels
       7,99%  postmaster  postgres           [.] spg_range_quad_choose
       7,21%  postmaster  postgres           [.] index_getprocinfo
       5,24%  postmaster  postgres           [.] range_cmp_bounds
       4,74%  postmaster  postgres           [.] AllocSetAlloc
       2,49%  postmaster  postgres           [.] btint4cmp
       2,49%  postmaster  postgres           [.] AllocSetFree
       1,98%  postmaster  postgres           [.] SpGistGetTypeSize
       1,63%  postmaster  postgres           [.] range_get_typcache
       1,62%  postmaster  postgres           [.] MemoryContextAlloc
       1,16%  postmaster  postgres           [.] pg_detoast_datum
       0,87%  postmaster  postgres           [.] PageIndexTupleDelete
       0,65%  postmaster  postgres           [.] pfree
       0,49%  postmaster  postgres           [.] XLogInsert

Drilling into the profile, I came up with three little optimizations:

1. Within spgdoinsert, a significant portion of the CPU time is spent on
line 2033 in spgdoinsert.c:

memset(&out, 0, sizeof(out));

That zeroes out a small struct allocated in the stack. Replacing that
with MemSet() makes it faster, reducing the time spent on zeroing that
struct from 10% to 1.5% of the time spent in spgdoinsert(). That's not
very much in the big scheme of things, but it's a trivial change so
seems worth it.

2. When spgdoinsert descends the tree, it calls index_getprocinfo()
every time it calls the user-defined "choose" function. By calling it
only once at the beginning of the function, the time spent in that
function drops from 7.21% to 0.02%.

3. Most of the AllocSetAlloc/AllocSetFree calls in the profile are
coming from spgExtractNodeLabels(). It first palloc's an array to hold
node labels, then it iterates through all the nodes in the inner tuple,
and if it turns out that there are no node labels, it pfrees the array
and returns NULL. With this opclass, there never are any node labels, so
spgExtractNodeLabels() always performs a pointless palloc+pfree. By
changing the function to first check if there are node labels, and only
performing the palloc when necessary, we can eliminate the time spent in
AllocSetAlloc and AllocSetFree, about 7% of the CPU time in total.

With those three changes, the profile now looks like this:

      22,57%  postmaster  postgres           [.] range_deserialize
      21,54%  postmaster  postgres           [.] spgdoinsert
      13,37%  postmaster  postgres           [.] FunctionCall2Coll
      11,13%  postmaster  postgres           [.] spg_range_quad_choose
       7,11%  postmaster  postgres           [.] range_cmp_bounds
       6,96%  postmaster  postgres           [.] spgExtractNodeLabels
       3,68%  postmaster  postgres           [.] btint4cmp
       3,05%  postmaster  postgres           [.] pg_detoast_datum
       2,53%  postmaster  postgres           [.] SpGistGetTypeSize
       2,47%  postmaster  postgres           [.] range_get_typcache
       1,22%  postmaster  postgres           [.] PageIndexTupleDelete
       0,66%  postmaster  postgres           [.] XLogInsert

Attached is a patch with those changes. Barring objections, will commit.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: SP-GiST micro-optimizations

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Drilling into the profile, I came up with three little optimizations:

> 1. Within spgdoinsert, a significant portion of the CPU time is spent on 
> line 2033 in spgdoinsert.c:

> memset(&out, 0, sizeof(out));

> That zeroes out a small struct allocated in the stack. Replacing that 
> with MemSet() makes it faster, reducing the time spent on zeroing that 
> struct from 10% to 1.5% of the time spent in spgdoinsert(). That's not 
> very much in the big scheme of things, but it's a trivial change so 
> seems worth it.

Fascinating.  I'd been of the opinion that modern compilers would inline
memset() for themselves and MemSet was probably not better than what the
compiler could do these days.  What platform are you testing on?

The other two changes seem reasonable.
        regards, tom lane



Re: SP-GiST micro-optimizations

From
Heikki Linnakangas
Date:
On 28.08.2012 20:30, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> Drilling into the profile, I came up with three little optimizations:
>
>> 1. Within spgdoinsert, a significant portion of the CPU time is spent on
>> line 2033 in spgdoinsert.c:
>
>> memset(&out, 0, sizeof(out));
>
>> That zeroes out a small struct allocated in the stack. Replacing that
>> with MemSet() makes it faster, reducing the time spent on zeroing that
>> struct from 10% to 1.5% of the time spent in spgdoinsert(). That's not
>> very much in the big scheme of things, but it's a trivial change so
>> seems worth it.
>
> Fascinating.  I'd been of the opinion that modern compilers would inline
> memset() for themselves and MemSet was probably not better than what the
> compiler could do these days.  What platform are you testing on?

x64, gcc 4.7.1, running Debian.

The assembly generated for the MemSet is:
.loc 1 2033 0 discriminator 3movq    $0, -432(%rbp)
.LVL166:movq    $0, -424(%rbp)
.LVL167:movq    $0, -416(%rbp)
.LVL168:movq    $0, -408(%rbp)
.LVL169:movq    $0, -400(%rbp)
.LVL170:movq    $0, -392(%rbp)

while the corresponding memset code is:
.loc 1 2040 0 discriminator 6xorl    %eax, %eax.loc 1 2042 0 discriminator 6cmpb    $0, -669(%rbp).loc 1 2040 0
discriminator6movq    -584(%rbp), %rdimovl    $6, %ecxrep stosq
 

In fact, with -mstringop=unrolled_loop, I can coerce gcc to produce code 
similar to the MemSet version:
movq    %rax, -440(%rbp).loc 1 2040 0 discriminator 6xorl    %eax, %eax
.L254:movl    %eax, %edxaddl    $32, %eaxcmpl    $32, %eaxmovq    $0, -432(%rbp,%rdx)movq    $0, -424(%rbp,%rdx)movq
$0,-416(%rbp,%rdx)movq    $0, -408(%rbp,%rdx)jb    .L254leaq    -432(%rbp), %r9addq    %r9, %rax.loc 1 2042 0
discriminator6cmpb    $0, -665(%rbp).loc 1 2040 0 discriminator 6movq    $0, (%rax)movq    $0, 8(%rax)
 

I'm not sure why gcc doesn't choose that by default. Perhaps it's CPU 
specific which variant is faster - I was quite surprised that MemSet was 
such a clear win on my laptop. Or maybe it's a speed-space tradeoff, and 
gcc chooses the more compact version, although using -O3 instead of -O2 
made no difference.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com



Re: SP-GiST micro-optimizations

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 28.08.2012 20:30, Tom Lane wrote:
>> Fascinating.  I'd been of the opinion that modern compilers would inline
>> memset() for themselves and MemSet was probably not better than what the
>> compiler could do these days.  What platform are you testing on?

> x64, gcc 4.7.1, running Debian.

> The assembly generated for the MemSet is:
> [ pretty darn tight ]
> while the corresponding memset code is:
> [ not so good ]

Seems like that's down to the CPU not doing "rep stosq" particularly
quickly, which might well be chip-specific.

Anyway, IIRC there are similar memsets for all the SPGiST opclass
invocation calls, so I guess you should switch them all not just these
two.
        regards, tom lane



Re: SP-GiST micro-optimizations

From
Ants Aasma
Date:
On Tue, Aug 28, 2012 at 9:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Seems like that's down to the CPU not doing "rep stosq" particularly
> quickly, which might well be chip-specific.

AMD optimization manual[1] states the following:
   For repeat counts of less than 4k, expand REP string instructions
into equivalent sequences of simple
AMD64 instructions.

Intel optimization manual[2] doesn't provide equivalent guidelines,
but the graph associated with string instructions states about 30
cycles of startup latency. The mov based code on the other hand
executes in 6 cycles and can easily overlap with other non-store
instructions.

[1] http://support.amd.com/us/Processor_TechDocs/25112.PDF
[2] http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: SP-GiST micro-optimizations

From
Heikki Linnakangas
Date:
On 28.08.2012 22:50, Ants Aasma wrote:
> On Tue, Aug 28, 2012 at 9:42 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Seems like that's down to the CPU not doing "rep stosq" particularly
>> quickly, which might well be chip-specific.
>
> AMD optimization manual[1] states the following:
>
>      For repeat counts of less than 4k, expand REP string instructions
> into equivalent sequences of simple
> AMD64 instructions.
>
> Intel optimization manual[2] doesn't provide equivalent guidelines,
> but the graph associated with string instructions states about 30
> cycles of startup latency. The mov based code on the other hand
> executes in 6 cycles and can easily overlap with other non-store
> instructions.
>
> [1] http://support.amd.com/us/Processor_TechDocs/25112.PDF
> [2] http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf

Hmm, sounds like gcc just isn't doing a very good job then. I also tried 
replacing the memset with variable initialization: "spgChooseOut out = { 
0 }" (and I moved that to where the memset was). In that case, gcc 
produced the same (fast) sequence of movq's I got with 
-mstringop=unrolled_loop.

Out of curiosity, I also tried this on clang. It produced this, 
regardless of whether I used MemSet or memset or variable initializer:
pxor    %xmm0, %xmm0.loc    1 2040 4                # spgdoinsert.c:2040:4movaps    %xmm0, -1280(%rbp)movaps    %xmm0,
-1296(%rbp)movaps   %xmm0, -1312(%rbp)
 

So, it's using movaps to clear it in 16-byte chunks. perf annotate shows 
that that's comparable in speed to the gcc's code produced for MemSet.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com