On Fri, Feb 20, 2026 at 03:21:05PM +0700, John Naylor wrote:
> On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Yeah, the plain C version might be marginally slower than the built-in
>> version for that test, but it still seems quite a bit faster than HEAD.
>>
>> HEAD v8 v10
>> 40 25 29
>
> (for the following, numbers are nanoseconds per call from
> drive_bms_num_members())
>
> Seems similar on S390X / gcc 13.3 (last week I only tested a single
> bitmapword and feel don't like repeating):
>
> master (older): 4.1083 (call builtin)
> v8: 2.8889 (inline builtin)
> v10: 2.7961 (inline pure C)
Thanks for testing it.
> On ppc64le / gcc 8.5, without native popcount it suffers:
>
> words master v14
> 1 4.5 6.5
> 2 5.8 9.7
> 64 67.9 101
> 128 143 190
>
> So one up, one down among obscure platforms. There seems to be a
> fairly thin case for the builtin anymore, although it's not zero.
I spent some time looking at how clang/gcc compiled the plain-C version on
various architectures [0], and I was pleasantly surprised to discover that
at some point in recent history they started automatically converting it to
special popcount instructions. I suspect that you'd see better results on
ppc64le if you upgraded the compiler...
[0] https://godbolt.org/z/v9vvx7E89
--
nathan