Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date
Msg-id 22170.1248108347@sss.pgh.pa.us
Whole thread Raw
In response to [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex  (Jeremy Kerr <jk@ozlabs.org>)
Responses Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
List pgsql-hackers
Jeremy Kerr <jk@ozlabs.org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.

> This requires a new check for __builtin_clz in the configure script.

> Results in a ~2% performance increase on sysbench on PowerPC.

I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop.  And there's a problem: on
x86_64 it is not much of a win.  The code sequence that gcc generates
for __builtin_clz is basically
bsrl    %eax, %eaxxorl    $31, %eax

and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction.  It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
cf http://www.intel.com/Assets/PDF/manual/248966.pdf

What I found on both a couple-years-old Xeon and a pretty new Macbook
is that the __builtin_clz code is actually *slower* than the naive loop
for the fewest-iterations case (request size up to 16 bytes) and about
a breakeven at the next size category (up to 32).  It doesn't start to
look like a useful win until you get to sizes up to 1KB or so.
Unfortunately PG spends a lot more time allocating little structs than
big ones.

I suppose that it's more of a win on PPC (where probably __builtin_clz
is actually fast), but on the basis of these results it's going to be
no gain and maybe a loss on Intel.

The other problem is that this approach is gcc-specific, which seems
particularly bad for something that's only going to be a win on
non-Intel platforms; they're much more likely to be using non-gcc
compilers.

I wonder whether it would work better to do manual unrolling of the
shift loop.  That is, forget about the builtin and do something like
       while (size >= 8)       {           idx += 4;           size >>= 4;       }       while (size)       {
idx++;          size >>= 1;       }
 

Obviously there's room for experimental optimization of the particular
shift strides we use here, but remember that the total shift count is
never going to exceed about 10, and will usually be quite a bit less.

I did some quick experimentation along these lines and couldn't really
persuade myself that unrolling was going to win much, but that's an
Intel-specific result.

Anyway, based on what I'm seeing here I can't convince myself that this
is worth introducing a compiler dependency for.
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: hot standby - merged up to CVS HEAD
Next
From: Tom Lane
Date:
Subject: Re: WIP: Deferrable unique constraints