Re: [PATCH] SVE popcount support - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PATCH] SVE popcount support
Date
Msg-id CANWCAZbE6Q7MiKP1iLEcWFoJm3cPEFns1+C76yF4LxzqwQiC4w@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] SVE popcount support  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: [PATCH] SVE popcount support
List pgsql-hackers
On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

> * 0002 introduces the Neon implementation, which conveniently doesn't need
>   configure-time checks or function pointers.  I noticed that some
>   compilers (e.g., Apple clang 16) compile in Neon instructions already,
>   but our hand-rolled implementation is better about instruction-level
>   parallelism and seems to still be quite a bit faster.

+pg_popcount64(uint64 word)
+{
+ return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}

This confused me until I found that this is what
__builtin_popcountl(word) would emit anyway. Worth a comment?

Some thoughts to consider, some speculative and maybe not worth
putting time into:

> I did add a 2-register block
>   in the Neon implementation for processing the tail because I was worried
>   about its performance on smaller buffers, but that part might get removed
>   if I can't measure any difference.

Even if we can measure a difference on fixed-sized inputs, that might
not carry over when the branch is unpredictable.

+ /*
+ * Process remaining 8-byte blocks.
+ */
+ for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+ {
+ popcnt += pg_popcount64(*((uint64 *) buf));
+ buf += sizeof(uint64);
+ }

This uses 16-byte registers, but only loads 8-bytes at a time (with
accumulation work), then a bytewise tail up to 7 bytes. Alternatively,
you could instead do a loop over a single local accumulator, which I
think could have a short accumulation pipeline since 3 iterations
can't overflow 8-bit lanes. But then the bytewise tail could be up to
15 bytes.

> * 0003 introduces the SVE implementation.  You'll notice I've moved all the
>   function pointer gymnastics into the pg_popcount_aarch64.c file, which is
>   where the Neon implementations live, too.  I also tried to clean up the
>   configure checks a bit.  I imagine it's possible to make them more
>   compact, but I felt that the enhanced readability was worth it.

I don't know what the configure checks looked like before, but I'm
confused that the loops are unrolled in the link-test functions as
well.

> * For both Neon and SVE, I do see improvements with looping over 4
>   registers at a time, so IMHO it's worth doing so even if it performs the
>   same as 2-register blocks on some hardware.

I wonder if alignment matters for these larger blocks.

--
John Naylor
Amazon Web Services



pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: support fast default for domain with constraints
Next
From: John Naylor
Date:
Subject: Re: Improve CRC32C performance on SSE4.2