On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Sure. I'm tempted to suggest that we only use the plain C version here,
> too. The SSE4.2 bms_num_members() test I did yesterday used it and showed
> improvement at one word. If we do that, we can rip out even more code
> since we no longer need the popcount built-ins.
Unlike the 32-bit case, people do run production on 64-bit platforms
that are not Arm/x86, so that would require effort to see if the
builtins are worth it for them. That seems like a separate effort. I
can help with that, but let's get the tested stuff in first.
> * tests plain C version on an Apple M3 *
>
> Yeah, the plain C version might be marginally slower than the built-in
> version for that test, but it still seems quite a bit faster than HEAD.
>
> HEAD v8 v10
> 40 25 29
That's good to know, and maybe it'll be true elsewhere.
--
John Naylor
Amazon Web Services