Home > mailing lists

Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Date	July 20, 2009 13:46:20
Msg-id	22170.1248108347@sss.pgh.pa.us Whole thread Raw
In response to	[PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex (Jeremy Kerr <jk@ozlabs.org>)
Responses	Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
List	pgsql-hackers

Tree view

Jeremy Kerr <jk@ozlabs.org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.

> This requires a new check for __builtin_clz in the configure script.

> Results in a ~2% performance increase on sysbench on PowerPC.

I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop.  And there's a problem: on
x86_64 it is not much of a win.  The code sequence that gcc generates
for __builtin_clz is basically
bsrl    %eax, %eaxxorl    $31, %eax

and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction.  It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
cf http://www.intel.com/Assets/PDF/manual/248966.pdf

What I found on both a couple-years-old Xeon and a pretty new Macbook
is that the __builtin_clz code is actually *slower* than the naive loop
for the fewest-iterations case (request size up to 16 bytes) and about
a breakeven at the next size category (up to 32).  It doesn't start to
look like a useful win until you get to sizes up to 1KB or so.
Unfortunately PG spends a lot more time allocating little structs than
big ones.

I suppose that it's more of a win on PPC (where probably __builtin_clz
is actually fast), but on the basis of these results it's going to be
no gain and maybe a loss on Intel.

The other problem is that this approach is gcc-specific, which seems
particularly bad for something that's only going to be a win on
non-Intel platforms; they're much more likely to be using non-gcc
compilers.

I wonder whether it would work better to do manual unrolling of the
shift loop.  That is, forget about the builtin and do something like
       while (size >= 8)       {           idx += 4;           size >>= 4;       }       while (size)       {
idx++;          size >>= 1;       }

Obviously there's room for experimental optimization of the particular
shift strides we use here, but remember that the total shift count is
never going to exceed about 10, and will usually be quite a bit less.

I did some quick experimentation along these lines and couldn't really
persuade myself that unrolling was going to win much, but that's an
Intel-specific result.

Anyway, based on what I'm seeing here I can't convince myself that this
is worth introducing a compiler dependency for.
        regards, tom lane

pgsql-hackers by date:

From: "Joshua D. Drake"
Date: 20 July 2009, 13:44:05
Subject: Re: hot standby - merged up to CVS HEAD

From: Tom Lane
Date: 20 July 2009, 14:02:38
Subject: Re: WIP: Deferrable unique constraints

Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex - Mailing list pgsql-hackers

Previous

Next