Thread: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex

[PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Jeremy Kerr
Date:
Move the shift-and-test login into a separate fls() function, which
can use __builtin_clz() if it's available.

This requires a new check for __builtin_clz in the configure script.

Results in a ~2% performance increase on PowerPC.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>

---
v3: respin as context diff

---configure.in                  |   13 +++++++++++++src/backend/utils/mmgr/aset.c |   34
+++++++++++++++++++++++++++-------2files changed, 40 insertions(+), 7 deletions(-)
 

*** a/configure.in
--- b/configure.in
***************
*** 1361,1366 **** case $host_os in
--- 1361,1379 ----         AC_FUNC_FSEEKO;; esac 
+ # GCC builtins
+ #
+ # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC
+ # will cause gcc to try to reference a non-builtin symbol.
+ 
+ AC_MSG_CHECKING([for __builtin_clz])
+ AC_TRY_LINK([],
+         [__builtin_clz(0);],
+         [AC_DEFINE(HAVE_BUILTIN_CLZ, 1,
+                 [Define to 1 if you have __builtin_clz().])
+             AC_MSG_RESULT(yes)],
+         [AC_MSG_RESULT(no)])
+   # # Pthreads
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 255,260 **** static MemoryContextMethods AllocSetMethods = {
--- 255,285 ---- #define AllocAllocInfo(_cxt, _chunk) #endif 
+ /*
+  * fls: find last set bit.
+  *
+  * Returns the 1-based index of the most-significant bit in x. The MSB
+  * is bit number 32, the LSB is bit number 1. If x is zero, the result is
+  * undefined.
+  */
+ static inline int
+ fls(unsigned int x)
+ {
+ #ifdef HAVE_BUILTIN_CLZ
+     return 32 - __builtin_clz(x);
+ #else
+     int ls = 0;
+ 
+     while (x != 0)
+     {
+         ls++;
+         x >>= 1;
+     }
+ 
+     return ls;
+ #endif
+ }
+  /* ----------  * AllocSetFreeIndex -  *
***************
*** 268,281 **** AllocSetFreeIndex(Size size) {     int            idx = 0; 
!     if (size > 0)     {
!         size = (size - 1) >> ALLOC_MINBITS;
!         while (size != 0)
!         {
!             idx++;
!             size >>= 1;
!         }         Assert(idx < ALLOCSET_NUM_FREELISTS);     } 
--- 293,301 ---- {     int            idx = 0; 
!     if (size > (1 << ALLOC_MINBITS))     {
!         idx = fls((size - 1) >> ALLOC_MINBITS);         Assert(idx < ALLOCSET_NUM_FREELISTS);     } 


Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Robert Haas
Date:
On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerr<jk@ozlabs.org> wrote:
> Move the shift-and-test login into a separate fls() function, which
> can use __builtin_clz() if it's available.
>
> This requires a new check for __builtin_clz in the configure script.
>
> Results in a ~2% performance increase on PowerPC.

There are some comments on this patch that have been posted to
commitfest.postgresql.org rather than emailed to -hackers.  Note to
all: please don't do this.  The purpose of commitfest.postgresql.org
is to summarize the mailing list discussion for easy reference, not to
replace the mailing lists.

That having been said, Jeremy, you probably want to take a look at
those comments and I have a few responses to them as well.

Dan Colish, the round-robin reviewer assigned to this patch, added the
following comment:
> Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement
> can be seen.

Question: are we only doing this because of PowerPC?  What is going to
happen on x86 and other architectures?  x86 is not the center of the
universe, but at a very minimum we need to make sure that things don't
go backwards on what is a very common platform.  Has anyone done any
benchmarking of this?

A related question: the original email for this patch says that it
results in a performance increase of about 2% on PPC.  But since it
gives no details on exactly what the test was that improved by 2%,
it's hard to judge the impact of this.  If this means that
AllocSetFreeIndex() is 2% faster, I think we should reject this patch
and move on.  It's not worth introducing architecture-specific code to
get a performance improvement that probably won't even move the needle
on a macro-benchmark.  On the other hand, if the test was something
like a pgbench run, then this is really very impressive.  But we don't
know which it is.

Zdenek Kotala added this comment:
> I have several objections:
>
> - inline is forbidden to use in PostgreSQL - you need exception or do it differently
>
> - type mismatch - Size vs unsigned int vs 32. you should use only Size and sizeof(Size)
>
> And general comment:
>
> Do we want to have this kind of code which is optimized for one compiler and platform. See openSSL or
> MySQL, they have many optimizations which work fine on one platform with specific version of compiler and
> specific version of OS. But if you select different version of compiler or different compiler you can get very
> different performance result (e.g. MySQL 30% degradation, OpenSSL RSA 3x faster and so on).
>
> I think in this case, it makes sense to have optimization here, but we should do it like spinlocks are
> implemented and put this code into /port. It should help to implemented special code for SPARC and SUN
> Studio for example.

I don't have any thoughts on this part beyond what I stated above, but
hopefully Jeremy does.

I am going to mark this "Waiting on author" pending a response to all
of the above, though hopefully Dan Colish will continue with his
reviewing efforts in the meantime.

Thanks,

...Robert


Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Stefan Kaltenbrunner
Date:
Robert Haas wrote:
> On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerr<jk@ozlabs.org> wrote:
>> Move the shift-and-test login into a separate fls() function, which
>> can use __builtin_clz() if it's available.
>>
>> This requires a new check for __builtin_clz in the configure script.
>>
>> Results in a ~2% performance increase on PowerPC.
> 
> There are some comments on this patch that have been posted to
> commitfest.postgresql.org rather than emailed to -hackers.  Note to
> all: please don't do this.  The purpose of commitfest.postgresql.org
> is to summarize the mailing list discussion for easy reference, not to
> replace the mailing lists.
> 
> That having been said, Jeremy, you probably want to take a look at
> those comments and I have a few responses to them as well.
> 
> Dan Colish, the round-robin reviewer assigned to this patch, added the
> following comment:
>> Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement
>> can be seen.
> 
> Question: are we only doing this because of PowerPC?  What is going to
> happen on x86 and other architectures?  x86 is not the center of the
> universe, but at a very minimum we need to make sure that things don't
> go backwards on what is a very common platform.  Has anyone done any
> benchmarking of this?
> 
> A related question: the original email for this patch says that it
> results in a performance increase of about 2% on PPC.  But since it
> gives no details on exactly what the test was that improved by 2%,
> it's hard to judge the impact of this.  If this means that
> AllocSetFreeIndex() is 2% faster, I think we should reject this patch
> and move on.  It's not worth introducing architecture-specific code to
> get a performance improvement that probably won't even move the needle
> on a macro-benchmark.  On the other hand, if the test was something
> like a pgbench run, then this is really very impressive.  But we don't
> know which it is.

There are numbers in the original thread this patch 
http://archives.postgresql.org/pgsql-hackers/2009-06/msg00159.php 
resulted in.

Stefan


Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Jeremy Kerr
Date:
Hi Robert,

> That having been said, Jeremy, you probably want to take a look at
> those comments and I have a few responses to them as well.

OK, thanks for the heads-up.

> following comment:
> > Applied and built cleanly. Regress passes. Trying to hunt down ppc
> > box to see if performance enhancement can be seen.
>
> Question: are we only doing this because of PowerPC?

No, this patch will benefit any architecture that has a gcc 
implementation of __builtin_clz. I know that both x86 and powerpc gcc 
support this.

> What is going to happen on x86 and other architectures?  x86 is not
> the center of the universe, but at a very minimum we need to make sure
> that things don't go backwards on what is a very common platform.  Has
> anyone done any benchmarking of this?

Yes, Atsushi Ogawa did some benchmarking with this patch on x86, his 
numbers are here:
http://archives.postgresql.org/message-id/4A2895C4.9050108@hi-ho.ne.jp

In fact, Atushi originally submitted a patch using inline asm (using 
"bsr") to do this on x86. Coincidentally, I was working on a powerpc 
implementation (using "cntlz") at the same time, so submitted a patch 
using the gcc builtin that would work on both (and possibly other) 
architectures.

> A related question: the original email for this patch says that it
> results in a performance increase of about 2% on PPC.  But since it
> gives no details on exactly what the test was that improved by 2%,
> it's hard to judge the impact of this.  If this means that
> AllocSetFreeIndex() is 2% faster, I think we should reject this patch
> and move on.  It's not worth introducing architecture-specific code
> to get a performance improvement that probably won't even move the
> needle on a macro-benchmark.  On the other hand, if the test was
> something like a pgbench run, then this is really very impressive. 
> But we don't know which it is.

The 2% improvement I saw is for a sysbench OLTP run. I'm happy to re-do 
the run and report specific numbers if you need them.

> Zdenek Kotala added this comment:
> > I have several objections:
> >
> > - inline is forbidden to use in PostgreSQL - you need exception or
> > do it differently
> >
> > - type mismatch - Size vs unsigned int vs 32. you should use only
> > Size and sizeof(Size)

OK, happy to make these changes. What's the commitfest process here, 
should I redo the patch and re-send to -hackers?

(inline again: should I just make this a static, the compiler can inline 
where possible? or do you want a macro?)

> > And general comment:
> >
> > Do we want to have this kind of code which is optimized for one
> > compiler and platform.

One compiler, multiple platforms.

> > See openSSL or MySQL, they have many
> > optimizations which work fine on one platform with specific version
> > of compiler and specific version of OS. But if you select different
> > version of compiler or different compiler you can get very
> > different performance result (e.g. MySQL 30% degradation, OpenSSL
> > RSA 3x faster and so on).

I don't think we're going to see a lot of variation here (besides the 
difference where __builtin_clz isn't available). Given that this is 
implemented with a couple of instructions, I'm confident that we won't 
see any degradation for platforms that support __builtin_clz. For the 
other cases, we just use the exiting while-loop algorithm so there 
should be no change (unless we mess up the inlining...).

> > I think in this case, it makes sense to have optimization here, but
> > we should do it like spinlocks are implemented and put this code
> > into /port.

Unless I'm missing something, spinlocks are done the same way - we have 
one file, src/include/storage/s_lock.h, which is mostly different 
implementations of the lock primitives for different platforms, 
separated by preprocessor tests.

Essentially, this is just one line of code, protected by 
HAVE_BUILTIN_CLZ (which is a feature-test, rather than a specific 
platform or compiler test), and is only used in one place. I don't think 
that warrants a separate file.

Cheers,


Jeremy


Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
Jeremy Kerr <jk@ozlabs.org> writes:
>>> - inline is forbidden to use in PostgreSQL - you need exception or
>>> do it differently

> (inline again: should I just make this a static, the compiler can inline 
> where possible? or do you want a macro?)

I don't know where Zdenek got the idea that we have something against
"inline".

So far as I can see, recent versions of gcc claim to support
__builtin_clz on all supported architectures.  On some it might be no
faster than our existing loop, but it seems unlikely to be slower.

The two comments I have are

* do something other than the hardwired "32" for word size; perhaps
sizeof(int) * BITS_PER_BYTE.

* do not use the separate "fls" function.  On a compiler that fails to
inline it, this patch would be a net performance loss, which we're not
likely to tolerate for a patch that has no other reason to live than
performance.  Just #if the builtin right into the one place where it
will be used.
        regards, tom lane


[PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Jeremy Kerr
Date:
Rather than testing single bits in a loop, change AllocSetFreeIndex to
use the __builtin_clz() function to calculate the chunk index.

This requires a new check for __builtin_clz in the configure script.

Results in a ~2% performance increase on sysbench on PowerPC.

Signed-off-by: Jeremy Kerr <jk@ozlabs.org>

---
v4: don't use separate fls() function, remove hardcoded 32

---configure.in                  |   13 +++++++++++++src/backend/utils/mmgr/aset.c |   14 +++++++++++++-2 files
changed,26 insertions(+), 1 deletion(-)
 

*** a/configure.in
--- b/configure.in
***************
*** 1361,1366 **** case $host_os in
--- 1361,1379 ----         AC_FUNC_FSEEKO;; esac 
+ # GCC builtins
+ #
+ # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC
+ # will cause gcc to try to reference a non-builtin symbol.
+ 
+ AC_MSG_CHECKING([for __builtin_clz])
+ AC_TRY_LINK([],
+         [__builtin_clz(0);],
+         [AC_DEFINE(HAVE_BUILTIN_CLZ, 1,
+                 [Define to 1 if you have __builtin_clz().])
+             AC_MSG_RESULT(yes)],
+         [AC_MSG_RESULT(no)])
+   # # Pthreads
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 268,281 **** AllocSetFreeIndex(Size size) {     int            idx = 0; 
!     if (size > 0)     {         size = (size - 1) >> ALLOC_MINBITS;         while (size != 0)         {
idx++;            size >>= 1;         }         Assert(idx < ALLOCSET_NUM_FREELISTS);     } 
 
--- 268,293 ---- {     int            idx = 0; 
!     if (size > (1 << ALLOC_MINBITS))     {         size = (size - 1) >> ALLOC_MINBITS;
+ 
+ #ifdef HAVE_BUILTIN_CLZ
+         /* If possible, use __builtin_clz to calculate the chunk index - this
+          * should be O(1) rather than O(n). The builtin takes an unsigned int,
+          * so we need to cast from the possibly 64-bit Size type. This cast
+          * won't overflow, as the limit is at 2^(32 + ALLOC_MINBITS) bytes,
+          * which is larger than ALLOC_CHUNK_LIMIT.
+          */
+         idx = sizeof(unsigned int) * BITS_PER_BYTE -
+                 __builtin_clz((unsigned int)size);
+ #else         while (size != 0)         {             idx++;             size >>= 1;         }
+ #endif         Assert(idx < ALLOCSET_NUM_FREELISTS);     } 


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
Jeremy Kerr <jk@ozlabs.org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.

> This requires a new check for __builtin_clz in the configure script.

> Results in a ~2% performance increase on sysbench on PowerPC.

I did some performance testing on this by extracting the
AllocSetFreeIndex function into a standalone test program that just
executed it a lot of times in a loop.  And there's a problem: on
x86_64 it is not much of a win.  The code sequence that gcc generates
for __builtin_clz is basically
bsrl    %eax, %eaxxorl    $31, %eax

and it turns out that Intel hasn't seen fit to put a lot of effort into
the BSR instruction.  It's constant time, all right, but on most of
their CPUs that constant time is like 8 or 16 times slower than an ADD;
cf http://www.intel.com/Assets/PDF/manual/248966.pdf

What I found on both a couple-years-old Xeon and a pretty new Macbook
is that the __builtin_clz code is actually *slower* than the naive loop
for the fewest-iterations case (request size up to 16 bytes) and about
a breakeven at the next size category (up to 32).  It doesn't start to
look like a useful win until you get to sizes up to 1KB or so.
Unfortunately PG spends a lot more time allocating little structs than
big ones.

I suppose that it's more of a win on PPC (where probably __builtin_clz
is actually fast), but on the basis of these results it's going to be
no gain and maybe a loss on Intel.

The other problem is that this approach is gcc-specific, which seems
particularly bad for something that's only going to be a win on
non-Intel platforms; they're much more likely to be using non-gcc
compilers.

I wonder whether it would work better to do manual unrolling of the
shift loop.  That is, forget about the builtin and do something like
       while (size >= 8)       {           idx += 4;           size >>= 4;       }       while (size)       {
idx++;          size >>= 1;       }
 

Obviously there's room for experimental optimization of the particular
shift strides we use here, but remember that the total shift count is
never going to exceed about 10, and will usually be quite a bit less.

I did some quick experimentation along these lines and couldn't really
persuade myself that unrolling was going to win much, but that's an
Intel-specific result.

Anyway, based on what I'm seeing here I can't convince myself that this
is worth introducing a compiler dependency for.
        regards, tom lane


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Jeremy Kerr <jk@ozlabs.org> writes:
>> Rather than testing single bits in a loop, change AllocSetFreeIndex to
>> use the __builtin_clz() function to calculate the chunk index.
> 
>> This requires a new check for __builtin_clz in the configure script.
> 
>> Results in a ~2% performance increase on sysbench on PowerPC.
> 
> I did some performance testing on this by extracting the
> AllocSetFreeIndex function into a standalone test program that just
> executed it a lot of times in a loop.  And there's a problem: on
> x86_64 it is not much of a win.  The code sequence that gcc generates
> for __builtin_clz is basically
> 
>     bsrl    %eax, %eax
>     xorl    $31, %eax
> 
> and it turns out that Intel hasn't seen fit to put a lot of effort into
> the BSR instruction.  It's constant time, all right, but on most of
> their CPUs that constant time is like 8 or 16 times slower than an ADD;
> cf http://www.intel.com/Assets/PDF/manual/248966.pdf


hmm interesting - I don't have the exact numbers any more but that 
patch(or a previous version of it) definitly showed a noticable 
improvement when I tested with sysbench on a current generation Intel 
Nehalem...


Stefan


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> and it turns out that Intel hasn't seen fit to put a lot of effort into
>> the BSR instruction.  It's constant time, all right, but on most of
>> their CPUs that constant time is like 8 or 16 times slower than an ADD;
>> cf http://www.intel.com/Assets/PDF/manual/248966.pdf

> hmm interesting - I don't have the exact numbers any more but that
> patch(or a previous version of it) definitly showed a noticable
> improvement when I tested with sysbench on a current generation Intel
> Nehalem...

Hmm.  I may be overestimating the importance of the smaller size
categories.  To try to get some trustworthy numbers, I made a quick-hack
patch (attachment #1) to count the actual numbers of calls to
AllocSetFreeIndex, and measured the totals for a run of the regression
tests on both a 32-bit machine and a 64-bit machine.  On 32 I got
these totals:

0    5190113
1    5663980
2    3573261
3    4476398
4    4246178
5    1100634
6    386501
7    601654
8    44884
9    52372
10    202801

and on 64 these:

0    2139534
1    5994692
2    5711479
3    3289034
4    4550931
5    2573389
6    487566
7    588470
8    155148
9    52750
10    202597

If you want to do the same in some other workload, feel free.  I
wouldn't trust the results from a single-purpose benchmark too much,
though.

I then put together a test harness that exercises AllocSetFreeIndex
according to these distributions (attachments #2,#3).  (Note: I found
out that it's necessary to split the test into two files --- otherwise
gcc will inline AllocSetFreeIndex and partially const-fold the work,
leading to skewed results.)

What I'm seeing with this harness on my x86 machines is that
__builtin_clz is indeed a bit faster than a naive loop, but not by
very much --- it saves maybe 25% of the runtime.  It's better on an
old PPC Mac; saves about 50%.  Still, these are not impressive numbers
for a microbenchmark that is testing *only* AllocSetFreeIndex.

I'm still interested in the idea of doing a manual unroll instead of
relying on a compiler-specific feature.  However, some quick testing
didn't find an unrolling that helps much.

            regards, tom lane

*** src/backend/storage/ipc/ipc.c.orig    Thu Jun 11 10:55:12 2009
--- src/backend/storage/ipc/ipc.c    Mon Jul 20 13:56:06 2009
***************
*** 183,188 ****
--- 183,190 ----
                                    on_proc_exit_list[on_proc_exit_index].arg);

      on_proc_exit_index = 0;
+
+     AllocSetPrintStats();
  }

  /* ------------------
*** src/backend/utils/mmgr/aset.c.orig    Thu Jun 11 10:55:25 2009
--- src/backend/utils/mmgr/aset.c    Mon Jul 20 13:56:19 2009
***************
*** 255,260 ****
--- 255,271 ----
  #define AllocAllocInfo(_cxt, _chunk)
  #endif

+ static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS];
+
+ void
+ AllocSetPrintStats()
+ {
+     int i;
+
+     for (i = 0; i < ALLOCSET_NUM_FREELISTS; i++)
+         fprintf(stderr, "category %2d count %lu\n", i, allocsizes[i]);
+ }
+
  /* ----------
   * AllocSetFreeIndex -
   *
***************
*** 277,282 ****
--- 288,294 ----
              size >>= 1;
          }
          Assert(idx < ALLOCSET_NUM_FREELISTS);
+         allocsizes[idx]++;
      }

      return idx;
/*
 * usage: time ./testit N
 *    N is a repeat count, set it large enough to get repeatable timings
 */

#include <stdio.h>
#include <stdlib.h>

#define ALLOC_MINBITS        3    /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS    11

/* number of calls to AllocSetFreeIndex with each result category */
static const int numoccurs[ALLOCSET_NUM_FREELISTS] = {
#ifdef USE_64BIT_COUNTS
2139534,
5994692,
5711479,
3289034,
4550931,
2573389,
487566,
588470,
155148,
52750,
202597
#else
5190113,
5663980,
3573261,
4476398,
4246178,
1100634,
386501,
601654,
44884,
52372,
202801
#endif
};


extern int AllocSetFreeIndex(size_t size);


int
main(int argc, char **argv)
{
    int        repeat = atoi(argv[1]);

    while (repeat-- > 0)
    {
        int        k;

        for (k = 0; k < ALLOCSET_NUM_FREELISTS; k++)
        {
            int        n = numoccurs[k];
            size_t    siz = (1 << (ALLOC_MINBITS + k));

            while (n-- > 0)
            {
                AllocSetFreeIndex(siz);
            }
        }
    }

    return 0;
}
#include <stdio.h>
#include <stdlib.h>

#define ALLOC_MINBITS        3    /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS    11
#define BITS_PER_BYTE 8

/* ----------
 * AllocSetFreeIndex -
 *
 *        Depending on the size of an allocation compute which freechunk
 *        list of the alloc set it belongs to.  Caller must have verified
 *        that size <= ALLOC_CHUNK_LIMIT.
 * ----------
 */
int
AllocSetFreeIndex(size_t size)
{
    int            idx = 0;

    if (size > (1 << ALLOC_MINBITS))
    {
        size = (size - 1) >> ALLOC_MINBITS;

#ifdef HAVE_BUILTIN_CLZ
        idx = sizeof(unsigned int) * BITS_PER_BYTE -
            __builtin_clz((unsigned int)size);
#else
        do
        {
            idx++;
            size >>= 1;
        }
        while (size != 0);
#endif
    }

    return idx;
}

Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
I wrote:
> I'm still interested in the idea of doing a manual unroll instead of
> relying on a compiler-specific feature.  However, some quick testing
> didn't find an unrolling that helps much.

Hmm, actually this seems to work ok:
       idx++;       size >>= 1;       if (size != 0)       {           idx++;           size >>= 1;           if (size
!=0)           {               idx++;               size >>= 1;               if (size != 0)               {
      idx++;                   size >>= 1;                   while (size != 0)                   {
idx++;                       size >>= 1;                   }               }           }       }
 

(this is with the initial "if (size > (1 << ALLOC_MINBITS))" so that
we know the starting value is nonzero)

This seems to be about a wash or a small gain on x86_64, but on my
PPC Mac laptop it's very nearly identical in speed to the __builtin_clz
code.  I also see a speedup on HPPA, for which my gcc is too old to
know about __builtin_clz.

Anyone want to see if they can beat that?  Some testing on other
architectures would help too.
        regards, tom lane


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Greg Stark
Date:
On Mon, Jul 20, 2009 at 8:37 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Anyone want to see if they can beat that?  Some testing on other
> architectures would help too.

Hm, I took the three implementations so far (normal, unrolled, and
clz) as well as the two from
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
and got some very strange results:


normal:       1.494s
clz:          2.214s
unrolled:     2.966s
lookup table: 0.001s
float hack:  11.930s


I can't see why the unrolled implementation is slower than the
non-unrolled so I'm suspecting something's wrong with my #ifdefs but I
don't see it.

I do think the code I grabbed from the stanford page might be
off-by-one for our purposes but I haven't looked closely at that.

I also wonder if this microbenchmark is actually ok because it's
testing the same value over and over so any branch-prediction will
shine unrealistically well.



--
greg
http://mit.edu/~gsstark/resume.pdf

Attachment

Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Greg Stark
Date:
Doh, I forgot this bit. Will rerun tests now.

On Mon, Jul 20, 2009 at 8:25 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>  (Note: I found
> out that it's necessary to split the test into two files --- otherwise
> gcc will inline AllocSetFreeIndex and partially const-fold the work,
> leading to skewed results.)




--
greg
http://mit.edu/~gsstark/resume.pdf


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> I also wonder if this microbenchmark is actually ok because it's
> testing the same value over and over so any branch-prediction will
> shine unrealistically well.

Yeah, that is a good point --- and it would benefit the unrolled loop
more than the other versions.  We should probably revise the test
harness so it mixes the size requests a bit.  I'm not sure of a suitably
simple way to do that, though.
        regards, tom lane


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Greg Stark
Date:
On Tue, Jul 21, 2009 at 12:07 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> I also wonder if this microbenchmark is actually ok because it's
>> testing the same value over and over so any branch-prediction will
>> shine unrealistically well.
>
> Yeah, that is a good point --- and it would benefit the unrolled loop
> more than the other versions.  We should probably revise the test
> harness so it mixes the size requests a bit.  I'm not sure of a suitably
> simple way to do that, though.

Well it was a bit of a pain but I filled an array with (1/1000 scaled
down) values and then permuted them. I also went ahead and set the
low-order bits to random values since the lookup table based algorithm
might be affected by it.

The results are a bit disappointing on my machine, only the CLZ and
lookup table come out significantly ahead:

$ ./testit 10
   0:      620
   1:     4949
   2:     5378
   3:     3560
   4:     4426
   5:     4218
   6:     1098
   7:      387
   8:      599
   9:       44
  10:       52
                 clz 1.530s
        lookup table 1.720s
          float hack 4.424s
            unrolled 5.280s
              normal 5.369s



--
greg
http://mit.edu/~gsstark/resume.pdf

Attachment

Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Well it was a bit of a pain but I filled an array with (1/1000 scaled
> down) values and then permuted them. I also went ahead and set the
> low-order bits to random values since the lookup table based algorithm
> might be affected by it.

> The results are a bit disappointing on my machine, only the CLZ and
> lookup table come out significantly ahead:

>                  clz 1.530s
>         lookup table 1.720s
>           float hack 4.424s
>             unrolled 5.280s
>               normal 5.369s

It strikes me that we could assume that the values are < 64K and hence
drop the first case in the lookup table code.  I've added that variant
and get these results on my machines:

x86_64 (Xeon):
                clz 15.357s       lookup table 16.582s small lookup table 16.705s         float hack 25.138s
unrolled64.630s             normal 79.025s
 

PPC:
                clz 3.842s       lookup table 7.298s small lookup table 8.799s         float hack 19.418s
unrolled7.656s             normal 8.949s
 

HPPA:                clz (n/a)       lookup table 11.515s small lookup table 10.803s         float hack 16.502s
 unrolled 17.632s             normal 19.754s
 

Not sure why the "small lookup table" variant actually seems slower
than the original on two of these machines; it can hardly be slower in
reality since it's strictly less code.  Maybe some weird code-alignment
issue?

It also seems weird that the x86_64 is now showing a much bigger gap
between clz and "normal" than before.  I don't see how branch prediction
would do much for the "normal" code.
        regards, tom lane


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Jeremy Kerr
Date:
Hi Greg,

Thanks for the benchmark app, thought I'd pitch in with some ppc 
results:

>                  clz 1.530s
>         lookup table 1.720s
>           float hack 4.424s
>             unrolled 5.280s
>               normal 5.369s

POWER5+:                clz 2.046s       lookup table 2.188s         float hack 7.786s           unrolled 6.353s
    normal 7.033s
 

POWER6:                clz 1.063s       lookup table 1.199s         float hack 3.575s           unrolled 2.290s
   normal 3.456s
 

Cheers,


Jeremy



Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
Jeremy Kerr <jk@ozlabs.org> writes:
> Thanks for the benchmark app, thought I'd pitch in with some ppc 
> results:

It looks to me like we should go with the lookup table approach, as
being the best tradeoff of speed improvement vs platform and compiler
independence.  The "float hack" is right out ;-)

I wonder whether it is worth fooling with alternatives for the data type
of the lookup table entries.  unsigned char might be a tad faster
(avoid useless sign-extension work).  int might be faster yet, but it
would enlarge the table, creating a distributed overhead that the
microbenchmark would completely fail to show.  Or we could buy that
back by reducing the table to cover only 6 bits, knowing that 12 is
more than we need.
        regards, tom lane


Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

From
Tom Lane
Date:
Jeremy Kerr <jk@ozlabs.org> writes:
> Rather than testing single bits in a loop, change AllocSetFreeIndex to
> use the __builtin_clz() function to calculate the chunk index.

Per subsequent discussion and testing, I've committed a patch that
uses a lookup table instead of depending on __builtin_clz().
http://archives.postgresql.org/message-id/20090721195312.207CA75331E@cvs.postgresql.org
        regards, tom lane