I've been preparing these for commit, and I've attached what I have so far.
A few notes:
* 0001 just renames the TRY_POPCNT_FAST macro to indicate that it's
x86_64-specific. IMO this is worth doing indpendent of this patch set,
but it's more important with the patch set since we need something
similar for Aarch64. I think we should also consider moving the x86_64
stuff to its own file (perhaps combining it with the AVX-512 stuff), but
that can probably wait until later.
* 0002 introduces the Neon implementation, which conveniently doesn't need
configure-time checks or function pointers. I noticed that some
compilers (e.g., Apple clang 16) compile in Neon instructions already,
but our hand-rolled implementation is better about instruction-level
parallelism and seems to still be quite a bit faster.
* 0003 introduces the SVE implementation. You'll notice I've moved all the
function pointer gymnastics into the pg_popcount_aarch64.c file, which is
where the Neon implementations live, too. I also tried to clean up the
configure checks a bit. I imagine it's possible to make them more
compact, but I felt that the enhanced readability was worth it.
* For both Neon and SVE, I do see improvements with looping over 4
registers at a time, so IMHO it's worth doing so even if it performs the
same as 2-register blocks on some hardware. I did add a 2-register block
in the Neon implementation for processing the tail because I was worried
about its performance on smaller buffers, but that part might get removed
if I can't measure any difference.
I'm planning to run several more benchmarks, but everything I've seen thus
far has looked pretty good.
--
nathan