On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote:
> It might be a good idea to do a little new testing, and I see a use
> for a special 8-byte path independent of AVX512: v6 seems to regress a
> little for single-words. But, it turns out that when gcc turns
> __builtin_popcountl into a single instruction, it's inline, but if it
> emits portable bitwise ops, it does so in a function called
> __popcountdi2(). That can be avoided by hand-coding in C for normal
> builds (and for 32-bit looks cleaner anyway), as in the attached 0005.
Oh, interesting. I looked into this a little more [0]. Both gcc and clang
generate cnt instructions for aarch64, so we're good there. However, clang
on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates
a call to __popcountdi2() (which I imagine does something similar). It's
not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang
start generating popcnt instructions for x86-64, which makes sense. 0005
seems like the correct move to me...
[0] https://godbolt.org/z/he3WozG3E
--
nathan