Hi David,
Thanks for looking at it!
Yes, good point. I got carried away with experiments and totally forgot that..
I've re-run tests with -O3 and it seems like the Summing of Adjacent Bits might make sense for uint64:
X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark
2025-12-05T16:22:43+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB
Load Average: 2.67, 4.72, 2.90
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 5.92 ns 5.90 ns 113871130
Popcnt64_AdjacentBits 5.30 ns 5.27 ns 115084258
Popcnt64_PG 8.34 ns 8.32 ns 80622869
Popcnt32_BranchlessLookup 3.62 ns 3.61 ns 197816110
Popcnt32_AdjacentBits 4.56 ns 4.55 ns 154932383
Popcnt32_PG 5.32 ns 5.31 ns 121845083
```
M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark
2025-12-05T14:29:18+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
L1 Data 128 KiB (x10)
L1 Instruction 192 KiB (x10)
L2 Unified 12288 KiB (x10)
Load Average: 0.01, 0.01, 0.00
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup 2.13 ns 2.13 ns 342101937
Popcnt64_AdjacentBits 1.89 ns 1.89 ns 370316571
Popcnt64_PG 3.83 ns 3.83 ns 190990088
Popcnt32_BranchlessLookup 1.19 ns 1.19 ns 591797267
Popcnt32_AdjacentBits 1.73 ns 1.73 ns 409381266
Popcnt32_PG 2.01 ns 2.01 ns 344969625
```
---
Cheers,
Andy