Thread: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
Move the shift-and-test login into a separate fls() function, which can use __builtin_clz() if it's available. This requires a new check for __builtin_clz in the configure script. Results in a ~2% performance increase on PowerPC. Signed-off-by: Jeremy Kerr <jk@ozlabs.org> --- v3: respin as context diff ---configure.in | 13 +++++++++++++src/backend/utils/mmgr/aset.c | 34 +++++++++++++++++++++++++++-------2files changed, 40 insertions(+), 7 deletions(-) *** a/configure.in --- b/configure.in *************** *** 1361,1366 **** case $host_os in --- 1361,1379 ---- AC_FUNC_FSEEKO;; esac + # GCC builtins + # + # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC + # will cause gcc to try to reference a non-builtin symbol. + + AC_MSG_CHECKING([for __builtin_clz]) + AC_TRY_LINK([], + [__builtin_clz(0);], + [AC_DEFINE(HAVE_BUILTIN_CLZ, 1, + [Define to 1 if you have __builtin_clz().]) + AC_MSG_RESULT(yes)], + [AC_MSG_RESULT(no)]) + # # Pthreads *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *************** *** 255,260 **** static MemoryContextMethods AllocSetMethods = { --- 255,285 ---- #define AllocAllocInfo(_cxt, _chunk) #endif + /* + * fls: find last set bit. + * + * Returns the 1-based index of the most-significant bit in x. The MSB + * is bit number 32, the LSB is bit number 1. If x is zero, the result is + * undefined. + */ + static inline int + fls(unsigned int x) + { + #ifdef HAVE_BUILTIN_CLZ + return 32 - __builtin_clz(x); + #else + int ls = 0; + + while (x != 0) + { + ls++; + x >>= 1; + } + + return ls; + #endif + } + /* ---------- * AllocSetFreeIndex - * *************** *** 268,281 **** AllocSetFreeIndex(Size size) { int idx = 0; ! if (size > 0) { ! size = (size - 1) >> ALLOC_MINBITS; ! while (size != 0) ! { ! idx++; ! size >>= 1; ! } Assert(idx < ALLOCSET_NUM_FREELISTS); } --- 293,301 ---- { int idx = 0; ! if (size > (1 << ALLOC_MINBITS)) { ! idx = fls((size - 1) >> ALLOC_MINBITS); Assert(idx < ALLOCSET_NUM_FREELISTS); }
On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerr<jk@ozlabs.org> wrote: > Move the shift-and-test login into a separate fls() function, which > can use __builtin_clz() if it's available. > > This requires a new check for __builtin_clz in the configure script. > > Results in a ~2% performance increase on PowerPC. There are some comments on this patch that have been posted to commitfest.postgresql.org rather than emailed to -hackers. Note to all: please don't do this. The purpose of commitfest.postgresql.org is to summarize the mailing list discussion for easy reference, not to replace the mailing lists. That having been said, Jeremy, you probably want to take a look at those comments and I have a few responses to them as well. Dan Colish, the round-robin reviewer assigned to this patch, added the following comment: > Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement > can be seen. Question: are we only doing this because of PowerPC? What is going to happen on x86 and other architectures? x86 is not the center of the universe, but at a very minimum we need to make sure that things don't go backwards on what is a very common platform. Has anyone done any benchmarking of this? A related question: the original email for this patch says that it results in a performance increase of about 2% on PPC. But since it gives no details on exactly what the test was that improved by 2%, it's hard to judge the impact of this. If this means that AllocSetFreeIndex() is 2% faster, I think we should reject this patch and move on. It's not worth introducing architecture-specific code to get a performance improvement that probably won't even move the needle on a macro-benchmark. On the other hand, if the test was something like a pgbench run, then this is really very impressive. But we don't know which it is. Zdenek Kotala added this comment: > I have several objections: > > - inline is forbidden to use in PostgreSQL - you need exception or do it differently > > - type mismatch - Size vs unsigned int vs 32. you should use only Size and sizeof(Size) > > And general comment: > > Do we want to have this kind of code which is optimized for one compiler and platform. See openSSL or > MySQL, they have many optimizations which work fine on one platform with specific version of compiler and > specific version of OS. But if you select different version of compiler or different compiler you can get very > different performance result (e.g. MySQL 30% degradation, OpenSSL RSA 3x faster and so on). > > I think in this case, it makes sense to have optimization here, but we should do it like spinlocks are > implemented and put this code into /port. It should help to implemented special code for SPARC and SUN > Studio for example. I don't have any thoughts on this part beyond what I stated above, but hopefully Jeremy does. I am going to mark this "Waiting on author" pending a response to all of the above, though hopefully Dan Colish will continue with his reviewing efforts in the meantime. Thanks, ...Robert
Re: [PATCH v3] Avoid manual shift-and-test logic in AllocSetFreeIndex
From
Stefan Kaltenbrunner
Date:
Robert Haas wrote: > On Tue, Jun 30, 2009 at 3:08 AM, Jeremy Kerr<jk@ozlabs.org> wrote: >> Move the shift-and-test login into a separate fls() function, which >> can use __builtin_clz() if it's available. >> >> This requires a new check for __builtin_clz in the configure script. >> >> Results in a ~2% performance increase on PowerPC. > > There are some comments on this patch that have been posted to > commitfest.postgresql.org rather than emailed to -hackers. Note to > all: please don't do this. The purpose of commitfest.postgresql.org > is to summarize the mailing list discussion for easy reference, not to > replace the mailing lists. > > That having been said, Jeremy, you probably want to take a look at > those comments and I have a few responses to them as well. > > Dan Colish, the round-robin reviewer assigned to this patch, added the > following comment: >> Applied and built cleanly. Regress passes. Trying to hunt down ppc box to see if performance enhancement >> can be seen. > > Question: are we only doing this because of PowerPC? What is going to > happen on x86 and other architectures? x86 is not the center of the > universe, but at a very minimum we need to make sure that things don't > go backwards on what is a very common platform. Has anyone done any > benchmarking of this? > > A related question: the original email for this patch says that it > results in a performance increase of about 2% on PPC. But since it > gives no details on exactly what the test was that improved by 2%, > it's hard to judge the impact of this. If this means that > AllocSetFreeIndex() is 2% faster, I think we should reject this patch > and move on. It's not worth introducing architecture-specific code to > get a performance improvement that probably won't even move the needle > on a macro-benchmark. On the other hand, if the test was something > like a pgbench run, then this is really very impressive. But we don't > know which it is. There are numbers in the original thread this patch http://archives.postgresql.org/pgsql-hackers/2009-06/msg00159.php resulted in. Stefan
Hi Robert, > That having been said, Jeremy, you probably want to take a look at > those comments and I have a few responses to them as well. OK, thanks for the heads-up. > following comment: > > Applied and built cleanly. Regress passes. Trying to hunt down ppc > > box to see if performance enhancement can be seen. > > Question: are we only doing this because of PowerPC? No, this patch will benefit any architecture that has a gcc implementation of __builtin_clz. I know that both x86 and powerpc gcc support this. > What is going to happen on x86 and other architectures? x86 is not > the center of the universe, but at a very minimum we need to make sure > that things don't go backwards on what is a very common platform. Has > anyone done any benchmarking of this? Yes, Atsushi Ogawa did some benchmarking with this patch on x86, his numbers are here: http://archives.postgresql.org/message-id/4A2895C4.9050108@hi-ho.ne.jp In fact, Atushi originally submitted a patch using inline asm (using "bsr") to do this on x86. Coincidentally, I was working on a powerpc implementation (using "cntlz") at the same time, so submitted a patch using the gcc builtin that would work on both (and possibly other) architectures. > A related question: the original email for this patch says that it > results in a performance increase of about 2% on PPC. But since it > gives no details on exactly what the test was that improved by 2%, > it's hard to judge the impact of this. If this means that > AllocSetFreeIndex() is 2% faster, I think we should reject this patch > and move on. It's not worth introducing architecture-specific code > to get a performance improvement that probably won't even move the > needle on a macro-benchmark. On the other hand, if the test was > something like a pgbench run, then this is really very impressive. > But we don't know which it is. The 2% improvement I saw is for a sysbench OLTP run. I'm happy to re-do the run and report specific numbers if you need them. > Zdenek Kotala added this comment: > > I have several objections: > > > > - inline is forbidden to use in PostgreSQL - you need exception or > > do it differently > > > > - type mismatch - Size vs unsigned int vs 32. you should use only > > Size and sizeof(Size) OK, happy to make these changes. What's the commitfest process here, should I redo the patch and re-send to -hackers? (inline again: should I just make this a static, the compiler can inline where possible? or do you want a macro?) > > And general comment: > > > > Do we want to have this kind of code which is optimized for one > > compiler and platform. One compiler, multiple platforms. > > See openSSL or MySQL, they have many > > optimizations which work fine on one platform with specific version > > of compiler and specific version of OS. But if you select different > > version of compiler or different compiler you can get very > > different performance result (e.g. MySQL 30% degradation, OpenSSL > > RSA 3x faster and so on). I don't think we're going to see a lot of variation here (besides the difference where __builtin_clz isn't available). Given that this is implemented with a couple of instructions, I'm confident that we won't see any degradation for platforms that support __builtin_clz. For the other cases, we just use the exiting while-loop algorithm so there should be no change (unless we mess up the inlining...). > > I think in this case, it makes sense to have optimization here, but > > we should do it like spinlocks are implemented and put this code > > into /port. Unless I'm missing something, spinlocks are done the same way - we have one file, src/include/storage/s_lock.h, which is mostly different implementations of the lock primitives for different platforms, separated by preprocessor tests. Essentially, this is just one line of code, protected by HAVE_BUILTIN_CLZ (which is a feature-test, rather than a specific platform or compiler test), and is only used in one place. I don't think that warrants a separate file. Cheers, Jeremy
Jeremy Kerr <jk@ozlabs.org> writes: >>> - inline is forbidden to use in PostgreSQL - you need exception or >>> do it differently > (inline again: should I just make this a static, the compiler can inline > where possible? or do you want a macro?) I don't know where Zdenek got the idea that we have something against "inline". So far as I can see, recent versions of gcc claim to support __builtin_clz on all supported architectures. On some it might be no faster than our existing loop, but it seems unlikely to be slower. The two comments I have are * do something other than the hardwired "32" for word size; perhaps sizeof(int) * BITS_PER_BYTE. * do not use the separate "fls" function. On a compiler that fails to inline it, this patch would be a net performance loss, which we're not likely to tolerate for a patch that has no other reason to live than performance. Just #if the builtin right into the one place where it will be used. regards, tom lane
Rather than testing single bits in a loop, change AllocSetFreeIndex to use the __builtin_clz() function to calculate the chunk index. This requires a new check for __builtin_clz in the configure script. Results in a ~2% performance increase on sysbench on PowerPC. Signed-off-by: Jeremy Kerr <jk@ozlabs.org> --- v4: don't use separate fls() function, remove hardcoded 32 ---configure.in | 13 +++++++++++++src/backend/utils/mmgr/aset.c | 14 +++++++++++++-2 files changed,26 insertions(+), 1 deletion(-) *** a/configure.in --- b/configure.in *************** *** 1361,1366 **** case $host_os in --- 1361,1379 ---- AC_FUNC_FSEEKO;; esac + # GCC builtins + # + # We need AC_TRY_LINK here, as the prototype generated by AC_CHECK_FUNC + # will cause gcc to try to reference a non-builtin symbol. + + AC_MSG_CHECKING([for __builtin_clz]) + AC_TRY_LINK([], + [__builtin_clz(0);], + [AC_DEFINE(HAVE_BUILTIN_CLZ, 1, + [Define to 1 if you have __builtin_clz().]) + AC_MSG_RESULT(yes)], + [AC_MSG_RESULT(no)]) + # # Pthreads *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *************** *** 268,281 **** AllocSetFreeIndex(Size size) { int idx = 0; ! if (size > 0) { size = (size - 1) >> ALLOC_MINBITS; while (size != 0) { idx++; size >>= 1; } Assert(idx < ALLOCSET_NUM_FREELISTS); } --- 268,293 ---- { int idx = 0; ! if (size > (1 << ALLOC_MINBITS)) { size = (size - 1) >> ALLOC_MINBITS; + + #ifdef HAVE_BUILTIN_CLZ + /* If possible, use __builtin_clz to calculate the chunk index - this + * should be O(1) rather than O(n). The builtin takes an unsigned int, + * so we need to cast from the possibly 64-bit Size type. This cast + * won't overflow, as the limit is at 2^(32 + ALLOC_MINBITS) bytes, + * which is larger than ALLOC_CHUNK_LIMIT. + */ + idx = sizeof(unsigned int) * BITS_PER_BYTE - + __builtin_clz((unsigned int)size); + #else while (size != 0) { idx++; size >>= 1; } + #endif Assert(idx < ALLOCSET_NUM_FREELISTS); }
Jeremy Kerr <jk@ozlabs.org> writes: > Rather than testing single bits in a loop, change AllocSetFreeIndex to > use the __builtin_clz() function to calculate the chunk index. > This requires a new check for __builtin_clz in the configure script. > Results in a ~2% performance increase on sysbench on PowerPC. I did some performance testing on this by extracting the AllocSetFreeIndex function into a standalone test program that just executed it a lot of times in a loop. And there's a problem: on x86_64 it is not much of a win. The code sequence that gcc generates for __builtin_clz is basically bsrl %eax, %eaxxorl $31, %eax and it turns out that Intel hasn't seen fit to put a lot of effort into the BSR instruction. It's constant time, all right, but on most of their CPUs that constant time is like 8 or 16 times slower than an ADD; cf http://www.intel.com/Assets/PDF/manual/248966.pdf What I found on both a couple-years-old Xeon and a pretty new Macbook is that the __builtin_clz code is actually *slower* than the naive loop for the fewest-iterations case (request size up to 16 bytes) and about a breakeven at the next size category (up to 32). It doesn't start to look like a useful win until you get to sizes up to 1KB or so. Unfortunately PG spends a lot more time allocating little structs than big ones. I suppose that it's more of a win on PPC (where probably __builtin_clz is actually fast), but on the basis of these results it's going to be no gain and maybe a loss on Intel. The other problem is that this approach is gcc-specific, which seems particularly bad for something that's only going to be a win on non-Intel platforms; they're much more likely to be using non-gcc compilers. I wonder whether it would work better to do manual unrolling of the shift loop. That is, forget about the builtin and do something like while (size >= 8) { idx += 4; size >>= 4; } while (size) { idx++; size >>= 1; } Obviously there's room for experimental optimization of the particular shift strides we use here, but remember that the total shift count is never going to exceed about 10, and will usually be quite a bit less. I did some quick experimentation along these lines and couldn't really persuade myself that unrolling was going to win much, but that's an Intel-specific result. Anyway, based on what I'm seeing here I can't convince myself that this is worth introducing a compiler dependency for. regards, tom lane
Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
From
Stefan Kaltenbrunner
Date:
Tom Lane wrote: > Jeremy Kerr <jk@ozlabs.org> writes: >> Rather than testing single bits in a loop, change AllocSetFreeIndex to >> use the __builtin_clz() function to calculate the chunk index. > >> This requires a new check for __builtin_clz in the configure script. > >> Results in a ~2% performance increase on sysbench on PowerPC. > > I did some performance testing on this by extracting the > AllocSetFreeIndex function into a standalone test program that just > executed it a lot of times in a loop. And there's a problem: on > x86_64 it is not much of a win. The code sequence that gcc generates > for __builtin_clz is basically > > bsrl %eax, %eax > xorl $31, %eax > > and it turns out that Intel hasn't seen fit to put a lot of effort into > the BSR instruction. It's constant time, all right, but on most of > their CPUs that constant time is like 8 or 16 times slower than an ADD; > cf http://www.intel.com/Assets/PDF/manual/248966.pdf hmm interesting - I don't have the exact numbers any more but that patch(or a previous version of it) definitly showed a noticable improvement when I tested with sysbench on a current generation Intel Nehalem... Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Tom Lane wrote: >> and it turns out that Intel hasn't seen fit to put a lot of effort into >> the BSR instruction. It's constant time, all right, but on most of >> their CPUs that constant time is like 8 or 16 times slower than an ADD; >> cf http://www.intel.com/Assets/PDF/manual/248966.pdf > hmm interesting - I don't have the exact numbers any more but that > patch(or a previous version of it) definitly showed a noticable > improvement when I tested with sysbench on a current generation Intel > Nehalem... Hmm. I may be overestimating the importance of the smaller size categories. To try to get some trustworthy numbers, I made a quick-hack patch (attachment #1) to count the actual numbers of calls to AllocSetFreeIndex, and measured the totals for a run of the regression tests on both a 32-bit machine and a 64-bit machine. On 32 I got these totals: 0 5190113 1 5663980 2 3573261 3 4476398 4 4246178 5 1100634 6 386501 7 601654 8 44884 9 52372 10 202801 and on 64 these: 0 2139534 1 5994692 2 5711479 3 3289034 4 4550931 5 2573389 6 487566 7 588470 8 155148 9 52750 10 202597 If you want to do the same in some other workload, feel free. I wouldn't trust the results from a single-purpose benchmark too much, though. I then put together a test harness that exercises AllocSetFreeIndex according to these distributions (attachments #2,#3). (Note: I found out that it's necessary to split the test into two files --- otherwise gcc will inline AllocSetFreeIndex and partially const-fold the work, leading to skewed results.) What I'm seeing with this harness on my x86 machines is that __builtin_clz is indeed a bit faster than a naive loop, but not by very much --- it saves maybe 25% of the runtime. It's better on an old PPC Mac; saves about 50%. Still, these are not impressive numbers for a microbenchmark that is testing *only* AllocSetFreeIndex. I'm still interested in the idea of doing a manual unroll instead of relying on a compiler-specific feature. However, some quick testing didn't find an unrolling that helps much. regards, tom lane *** src/backend/storage/ipc/ipc.c.orig Thu Jun 11 10:55:12 2009 --- src/backend/storage/ipc/ipc.c Mon Jul 20 13:56:06 2009 *************** *** 183,188 **** --- 183,190 ---- on_proc_exit_list[on_proc_exit_index].arg); on_proc_exit_index = 0; + + AllocSetPrintStats(); } /* ------------------ *** src/backend/utils/mmgr/aset.c.orig Thu Jun 11 10:55:25 2009 --- src/backend/utils/mmgr/aset.c Mon Jul 20 13:56:19 2009 *************** *** 255,260 **** --- 255,271 ---- #define AllocAllocInfo(_cxt, _chunk) #endif + static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS]; + + void + AllocSetPrintStats() + { + int i; + + for (i = 0; i < ALLOCSET_NUM_FREELISTS; i++) + fprintf(stderr, "category %2d count %lu\n", i, allocsizes[i]); + } + /* ---------- * AllocSetFreeIndex - * *************** *** 277,282 **** --- 288,294 ---- size >>= 1; } Assert(idx < ALLOCSET_NUM_FREELISTS); + allocsizes[idx]++; } return idx; /* * usage: time ./testit N * N is a repeat count, set it large enough to get repeatable timings */ #include <stdio.h> #include <stdlib.h> #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 /* number of calls to AllocSetFreeIndex with each result category */ static const int numoccurs[ALLOCSET_NUM_FREELISTS] = { #ifdef USE_64BIT_COUNTS 2139534, 5994692, 5711479, 3289034, 4550931, 2573389, 487566, 588470, 155148, 52750, 202597 #else 5190113, 5663980, 3573261, 4476398, 4246178, 1100634, 386501, 601654, 44884, 52372, 202801 #endif }; extern int AllocSetFreeIndex(size_t size); int main(int argc, char **argv) { int repeat = atoi(argv[1]); while (repeat-- > 0) { int k; for (k = 0; k < ALLOCSET_NUM_FREELISTS; k++) { int n = numoccurs[k]; size_t siz = (1 << (ALLOC_MINBITS + k)); while (n-- > 0) { AllocSetFreeIndex(siz); } } } return 0; } #include <stdio.h> #include <stdlib.h> #define ALLOC_MINBITS 3 /* smallest chunk size is 8 bytes */ #define ALLOCSET_NUM_FREELISTS 11 #define BITS_PER_BYTE 8 /* ---------- * AllocSetFreeIndex - * * Depending on the size of an allocation compute which freechunk * list of the alloc set it belongs to. Caller must have verified * that size <= ALLOC_CHUNK_LIMIT. * ---------- */ int AllocSetFreeIndex(size_t size) { int idx = 0; if (size > (1 << ALLOC_MINBITS)) { size = (size - 1) >> ALLOC_MINBITS; #ifdef HAVE_BUILTIN_CLZ idx = sizeof(unsigned int) * BITS_PER_BYTE - __builtin_clz((unsigned int)size); #else do { idx++; size >>= 1; } while (size != 0); #endif } return idx; }
I wrote: > I'm still interested in the idea of doing a manual unroll instead of > relying on a compiler-specific feature. However, some quick testing > didn't find an unrolling that helps much. Hmm, actually this seems to work ok: idx++; size >>= 1; if (size != 0) { idx++; size >>= 1; if (size !=0) { idx++; size >>= 1; if (size != 0) { idx++; size >>= 1; while (size != 0) { idx++; size >>= 1; } } } } (this is with the initial "if (size > (1 << ALLOC_MINBITS))" so that we know the starting value is nonzero) This seems to be about a wash or a small gain on x86_64, but on my PPC Mac laptop it's very nearly identical in speed to the __builtin_clz code. I also see a speedup on HPPA, for which my gcc is too old to know about __builtin_clz. Anyone want to see if they can beat that? Some testing on other architectures would help too. regards, tom lane
On Mon, Jul 20, 2009 at 8:37 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Anyone want to see if they can beat that? Some testing on other > architectures would help too. Hm, I took the three implementations so far (normal, unrolled, and clz) as well as the two from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious and got some very strange results: normal: 1.494s clz: 2.214s unrolled: 2.966s lookup table: 0.001s float hack: 11.930s I can't see why the unrolled implementation is slower than the non-unrolled so I'm suspecting something's wrong with my #ifdefs but I don't see it. I do think the code I grabbed from the stanford page might be off-by-one for our purposes but I haven't looked closely at that. I also wonder if this microbenchmark is actually ok because it's testing the same value over and over so any branch-prediction will shine unrealistically well. -- greg http://mit.edu/~gsstark/resume.pdf
Attachment
Doh, I forgot this bit. Will rerun tests now. On Mon, Jul 20, 2009 at 8:25 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > (Note: I found > out that it's necessary to split the test into two files --- otherwise > gcc will inline AllocSetFreeIndex and partially const-fold the work, > leading to skewed results.) -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > I also wonder if this microbenchmark is actually ok because it's > testing the same value over and over so any branch-prediction will > shine unrealistically well. Yeah, that is a good point --- and it would benefit the unrolled loop more than the other versions. We should probably revise the test harness so it mixes the size requests a bit. I'm not sure of a suitably simple way to do that, though. regards, tom lane
On Tue, Jul 21, 2009 at 12:07 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Greg Stark <gsstark@mit.edu> writes: >> I also wonder if this microbenchmark is actually ok because it's >> testing the same value over and over so any branch-prediction will >> shine unrealistically well. > > Yeah, that is a good point --- and it would benefit the unrolled loop > more than the other versions. We should probably revise the test > harness so it mixes the size requests a bit. I'm not sure of a suitably > simple way to do that, though. Well it was a bit of a pain but I filled an array with (1/1000 scaled down) values and then permuted them. I also went ahead and set the low-order bits to random values since the lookup table based algorithm might be affected by it. The results are a bit disappointing on my machine, only the CLZ and lookup table come out significantly ahead: $ ./testit 10 0: 620 1: 4949 2: 5378 3: 3560 4: 4426 5: 4218 6: 1098 7: 387 8: 599 9: 44 10: 52 clz 1.530s lookup table 1.720s float hack 4.424s unrolled 5.280s normal 5.369s -- greg http://mit.edu/~gsstark/resume.pdf
Attachment
Greg Stark <gsstark@mit.edu> writes: > Well it was a bit of a pain but I filled an array with (1/1000 scaled > down) values and then permuted them. I also went ahead and set the > low-order bits to random values since the lookup table based algorithm > might be affected by it. > The results are a bit disappointing on my machine, only the CLZ and > lookup table come out significantly ahead: > clz 1.530s > lookup table 1.720s > float hack 4.424s > unrolled 5.280s > normal 5.369s It strikes me that we could assume that the values are < 64K and hence drop the first case in the lookup table code. I've added that variant and get these results on my machines: x86_64 (Xeon): clz 15.357s lookup table 16.582s small lookup table 16.705s float hack 25.138s unrolled64.630s normal 79.025s PPC: clz 3.842s lookup table 7.298s small lookup table 8.799s float hack 19.418s unrolled7.656s normal 8.949s HPPA: clz (n/a) lookup table 11.515s small lookup table 10.803s float hack 16.502s unrolled 17.632s normal 19.754s Not sure why the "small lookup table" variant actually seems slower than the original on two of these machines; it can hardly be slower in reality since it's strictly less code. Maybe some weird code-alignment issue? It also seems weird that the x86_64 is now showing a much bigger gap between clz and "normal" than before. I don't see how branch prediction would do much for the "normal" code. regards, tom lane
Hi Greg, Thanks for the benchmark app, thought I'd pitch in with some ppc results: > clz 1.530s > lookup table 1.720s > float hack 4.424s > unrolled 5.280s > normal 5.369s POWER5+: clz 2.046s lookup table 2.188s float hack 7.786s unrolled 6.353s normal 7.033s POWER6: clz 1.063s lookup table 1.199s float hack 3.575s unrolled 2.290s normal 3.456s Cheers, Jeremy
Jeremy Kerr <jk@ozlabs.org> writes: > Thanks for the benchmark app, thought I'd pitch in with some ppc > results: It looks to me like we should go with the lookup table approach, as being the best tradeoff of speed improvement vs platform and compiler independence. The "float hack" is right out ;-) I wonder whether it is worth fooling with alternatives for the data type of the lookup table entries. unsigned char might be a tad faster (avoid useless sign-extension work). int might be faster yet, but it would enlarge the table, creating a distributed overhead that the microbenchmark would completely fail to show. Or we could buy that back by reducing the table to cover only 6 bits, knowing that 12 is more than we need. regards, tom lane
Jeremy Kerr <jk@ozlabs.org> writes: > Rather than testing single bits in a loop, change AllocSetFreeIndex to > use the __builtin_clz() function to calculate the chunk index. Per subsequent discussion and testing, I've committed a patch that uses a lookup table instead of depending on __builtin_clz(). http://archives.postgresql.org/message-id/20090721195312.207CA75331E@cvs.postgresql.org regards, tom lane