Currently, all platforms must indirect through a function pointer to call popcount on a word-sized input, even though we don't arrange for a fast implementation on non-x86 to make it worthwhile.
0001 moves some declarations around so that "slow" popcount functions are called directly on non-x86 platforms.
0002 was an idea to simplify and unify the coding for the slow functions.
Also attached is a test module for building microbenchmarks.
On a Power8 machine using gcc 4.8, and running
time ./inst/bin/psql -c 'select drive_popcount(100000, 1024)'
I get
master: 647ms
0001: 183ms
0002: 228ms
So 0001 is a clear winner on that platform. 0002 is still good, but slower than 0001 for some reason, and it turns out that on master, gcc does emit a popcnt instruction from the intrinsic:
0000000000000000 <pg_popcount32_slow>:
0: f4 02 63 7c popcntw r3,r3
4: b4 07 63 7c extsw r3,r3
8: 20 00 80 4e blr
...
The gcc docs mention a flag for this, but I'm not sure why it seems not to need it:
https://gcc.gnu.org/onlinedocs/gcc/RS_002f6000-and-PowerPC-Options.html#RS_002f6000-and-PowerPC-OptionsMaybe that's because the machine I used was ppc64le, but I'm not sure a ppc binary built like this is portable to other hardware. For that reason, maybe 0002 is a good idea.
--