I wrote:
> I'm still interested in the idea of doing a manual unroll instead of
> relying on a compiler-specific feature. However, some quick testing
> didn't find an unrolling that helps much.
Hmm, actually this seems to work ok:
idx++; size >>= 1; if (size != 0) { idx++; size >>= 1; if (size
!=0) { idx++; size >>= 1; if (size != 0) {
idx++; size >>= 1; while (size != 0) {
idx++; size >>= 1; } } } }
(this is with the initial "if (size > (1 << ALLOC_MINBITS))" so that
we know the starting value is nonzero)
This seems to be about a wash or a small gain on x86_64, but on my
PPC Mac laptop it's very nearly identical in speed to the __builtin_clz
code. I also see a speedup on HPPA, for which my gcc is too old to
know about __builtin_clz.
Anyone want to see if they can beat that? Some testing on other
architectures would help too.
regards, tom lane