> I don't think the proposed improvements are relevant for either of the > machines you used for your benchmarks. For x86, we've optimized our > popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it > to use Neon or SVE. And for other systems, we still try to use > __builtin_popcount() and friends in the fallback paths, which IIUC are > available on both gcc and clang (and maybe elsewhere). IMHO we need to run > the benchmarks on a compiler/architecture combination where it would > actually be used in practice.
Yes, I saw that the code is on a rather obscure path, but those machines were my only options for quick benchmarks. I reasoned that the code path still exists, and eliminating branching there would be beneficial anyway (most probably). But you are right, we need to test it on target architectures/compilers. I'll try to do with that.