Thread: [PATCH] SVE popcount support
- Attachments protected by Amazon:
- 0001-SVE-popcount-support.patch |
- SVE-popcount-support-PostgreSQL.png |
Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0] to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.
[0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com
On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote: > > Attachments protected by Amazon: 0001-SVE-popcount-support.patch | SVE-popcount-support-PostgreSQL.png | > Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27, 2024,15:43 (UTC+00:00). Tell us what you think > For more information click here > > Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0] toput together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and itshowed the expected improvements with many bytes. > > > > [0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com Hi! To register entry on commitfest you need to send patch in one of this format: https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F This is useful for reviewers who use cfbot or cputube. -- Best regards, Kirill Reshke
On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote:
>
> Attachments protected by Amazon: 0001-SVE-popcount-support.patch | SVE-popcount-support-PostgreSQL.png |
> Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us what you think
> For more information click here
>
> Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0] to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.
>
>
>
> [0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com
Hi!
I did look inside this patch. This was implemented mostly in the same way as the current instructure selecting code, which is good.
=== patch itself
1)
> // for small buffer sizes (<= 128-bytes), execute 1-byte SVE instructions
> // for larger buffer sizes (> 128-bytes), execute 1-byte + 8-byte SVE instructions
> // loop unroll by 2
PostgreSQL uses /* */ comment style.
2)
> + if (bytes <= 128)
> + {
> + prologue_loop_bytes = bytes;
> + }
> + else
> + {
> + aligned_buf = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf) + sizeof(uint64_t);
> + prologue_loop_bytes = aligned_buf - buf;
> + }
For a single line stmt PostgreSQL does not use parenthesis. Examples [0] & [1]
[0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68
3) `if (bytes > 128)` Loop in pg_popcount_sve function should be commented. There is too much code without any comment why it works. For example, If original source of this is some paper or other work, we can reference it.
==== by-hand benching (I also use John Naylor's module)
non-patched
```
db1=# \timing
Timing is on.
db1=# select drive_popcount(10000000, 10000);
drive_popcount
----------------
64608
(1 row)
Time: 8886.493 ms (00:08.886) -- with small variance (+- 100ms)
db1=# select drive_popcount64(10000000, 10000);
drive_popcount64
------------------
64608
(1 row)
Time: 139501.555 ms (02:19.502) with small variance (+- 1-2sec)
```
patched
```
db1=# select drive_popcount(10000000, 10000);
drive_popcount
----------------
64608
(1 row)
Time: 8803.855 ms (00:08.804) -- with small variance
db1=# select drive_popcount64(10000000, 10000);
drive_popcount64
------------------
64608
(1 row)
Time: 200716.879 ms (02:21.717) -- with small variance
```
I'm not sure how to interpret these results. Looks like this does not help much on a large $num?
--
Best regards,
Kirill Reshke
On Wed, Nov 27, 2024 at 03:43:27PM +0000, Malladi, Rama wrote: > • Attachments protected by Amazon: > • 0001-SVE-popcount-support.patch | > • SVE-popcount-support-PostgreSQL.png | > > Amazon has replaced the attachments in this email with download links. > Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us > what you think > For more information click here > > Please find attached a patch to PostgreSQL implementing SVE popcount. I used > John Naylor's test_popcount module [0] to put together the attached graphs. > This test didn't show any regressions with a relatively small number of bytes, > and it showed the expected improvements with many bytes. You must attach actual attachments for this to be considered. Download links are unacceptable. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote: > Thank you, Kirill, for the review and the feedback. Please find inline my > reply and an updated patch. Thanks for the updated patch. I have a couple of high-level comments. Would you mind adding this to the commitfest system (https://commitfest.postgresql.org/) so that it is picked up by our automated patch testing tools? > +# Check for ARMv8 SVE popcount intrinsics > +# > +CFLAGS_POPCNT="" > +PG_POPCNT_OBJS="" > +PGAC_SVE_POPCNT_INTRINSICS([]) > +if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then > + PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve]) > +fi > +if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then > + PG_POPCNT_OBJS="pg_popcount_sve.o" > + AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.]) > +fi > +AC_SUBST(CFLAGS_POPCNT) > +AC_SUBST(PG_POPCNT_OBJS) We recently switched some intrinsics support in PostgreSQL to use __attribute__((target("..."))) instead of applying special compiler flags to specific files (e.g., commits f78667b and 4b03a27). The hope is that this approach will be a little more sustainable as we add more architecture-specific code. IMHO we should do something similar here. While this means that older versions of clang might not pick up this optimization (see the commit message for 4b03a27 for details), I think that's okay because 1) this patch is intended for the next major version of Postgres, which will take some time for significant adoption, and 2) this is brand new code, so we aren't introducing any regressions for current users. > +#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK > +extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes); Could we combine this with the existing copy above this line? I'm thinking of something like #if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK) extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...) #endif #ifdef TRY_POPCNT_FAST ... > +#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK > +extern uint64 pg_popcount_sve(const char *buf, int bytes); > +extern int check_sve_support(void); > +#endif Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(), and pg_popcount_masked(), too? > +static inline uint64 > +pg_popcount_choose(const char *buf, int bytes) > +{ > + if (check_sve_support()) > + pg_popcount_optimized = pg_popcount_sve; > + else > + pg_popcount_optimized = pg_popcount_slow; > + return pg_popcount_optimized(buf, bytes); > +} > + > +#endif /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */ Can we put this code in the existing choose_popcount_functions() function in pg_bitutils.c? > +// check if sve supported > +int check_sve_support(void) > +{ > + // Read ID_AA64PFR0_EL1 register > + uint64_t pfr0; > + __asm__ __volatile__( > + "mrs %0, ID_AA64PFR0_EL1" > + : "=r" (pfr0)); > + > + // SVE bits are 32-35 > + return (pfr0 >> 32) & 0xf; > +} Is this based on some reference code from a manual that we could cite here? Or better yet, is it possible to do this without inline assembly (e.g., with another intrinsic function)? > +/* > + * pg_popcount_sve > + * Returns the number of 1-bits in buf > + */ > +uint64 > +pg_popcount_sve(const char *buf, int bytes) I think this function could benefit from some additional comments to explain what is happening at each step. -- nathan
The meson configure check seems to fail on my machine: error: too many arguments to function call, expected 0, have 1 10 | svuint64_t popcnt = svcntb(val); | ~~~~~~ ^~~ error: returning '__SVInt64_t' from a function with incompatible result type 'int' 12 | return popcnt == 0; | ^~~~~~~~~~~ The autoconf version seems to work okay, though. + pgac_save_CFLAGS=$CFLAGS + CFLAGS="$pgac_save_CFLAGS $1" I don't see any extra compiler flag tests used, so we no longer need this, right? + if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then + pgac_arm_sve_popcnt_intrinsics=yes + fi I'm curious why this doesn't use Ac_cachevar like the examples above it (e.g., PGAC_XSAVE_INTRINSICS). + prog = ''' +#include <arm_sve.h> + +#if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("arch=armv8-a+sve"))) +#endif +int main(void) +{ + const svuint64_t val = svdup_u64(0xFFFFFFFFFFFFFFFF); + svuint64_t popcnt = svcntb(val); + /* return computed value, to prevent the above being optimized away */ + return popcnt == 0; +} +''' This test looks quite different than the autoconf one. Why is that? I would expect them to be the same. And I think ideally the test would check that all the intrinsics functions we need are available. +/* + * Returns true if the CPU supports the instructions required for the SVE + * pg_popcount() implementation. + */ +bool +pg_popcount_sve_available(void) +{ + return getauxval(AT_HWCAP) & HWCAP_SVE; +} pg_crc32c_armv8_available() (in pg_crc32c_armv8_choose.c) looks quite a bit more complicated than this. Are we missing something here? + /* + * For smaller inputs, aligning the buffer degrades the performance. + * Therefore, the buffers only when the input size is sufficiently large. + */ Is the inverse true, i.e., does aligning the buffer improve performance for larger inputs? I'm also curious what level of performance degradation you were seeing. +#include "c.h" +#include "port/pg_bitutils.h" + +#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK nitpick: The USE_SVE_POPCNT_WITH_RUNTIME_CHECK check can probably go above the #include for pg_bitutils.h (but below the one for c.h). -- nathan
On Tue, Feb 04, 2025 at 09:01:33AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: >> + /* >> + * For smaller inputs, aligning the buffer degrades the performance. >> + * Therefore, the buffers only when the input size is sufficiently large. >> + */ > >> Is the inverse true, i.e., does aligning the buffer improve performance for >> larger inputs? I'm also curious what level of performance degradation you >> were seeing. > > Here is a comparison of all three cases. Alignment is marginally better for inputs > above 1024B, but the difference is small. Unaligned performs better for smaller inputs. > Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)" > Always Aligned => condition "bytes > 4 * vec_len" is removed. > Unaligned => the whole if block was removed > > buf | Always Aligned | Aligned After 128B | Unaligned > --------+---------------+--------------------+------------ > 16 | 37.851 | 38.203 | 34.971 > 32 | 37.859 | 38.187 | 34.972 > 64 | 37.611 | 37.405 | 34.121 > 128 | 45.357 | 45.897 | 41.890 > 256 | 62.440 | 63.454 | 58.666 > 512 | 100.120 | 102.767 | 99.861 > 1024 | 159.574 | 158.594 | 164.975 > 2048 | 282.354 | 281.198 | 283.937 > 4096 | 532.038 | 531.068 | 533.699 > 8192 | 1038.973 | 1038.083 | 1039.206 > 16384 | 2028.604 | 2025.843 | 2033.940 Hm. These results are so similar that I'm tempted to suggest we just remove the section of code dedicated to alignment. Is there any reason not to do that? + /* Process 2 complete vectors */ + for (; i < loop_bytes; i += vec_len * 2) + { + vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64); + accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64)); + vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64); + accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64)); + } Does this hand-rolled loop unrolling offer any particular advantage? What do the numbers look like if we don't do this or if we process, say, 4 vectors at a time? -- nathan
On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: >> Does this hand-rolled loop unrolling offer any particular advantage? What >> do the numbers look like if we don't do this or if we process, say, 4 >> vectors at a time? > > The unrolled version performs better than the non-unrolled one, but > processing four vectors provides no additional benefit. The numbers > and code used are given below. Hm. Any idea why that is? I wonder if the compiler isn't using as many SVE registers as it could for this. -- nathan
On Thu, Feb 06, 2025 at 10:33:35AM -0600, Nathan Bossart wrote: > On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: >>> Does this hand-rolled loop unrolling offer any particular advantage? What >>> do the numbers look like if we don't do this or if we process, say, 4 >>> vectors at a time? >> >> The unrolled version performs better than the non-unrolled one, but >> processing four vectors provides no additional benefit. The numbers >> and code used are given below. > > Hm. Any idea why that is? I wonder if the compiler isn't using as many > SVE registers as it could for this. I've also noticed that the latest patch doesn't compile on my M3 macOS machine. After a quick glance, I think the problem is that the TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly versions. ../postgresql/src/port/pg_bitutils.c:230:41: error: invalid output constraint '=q' in asm 230 | __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc"); | ^ ../postgresql/src/port/pg_bitutils.c:247:41: error: invalid output constraint '=q' in asm 247 | __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc"); | ^ 2 errors generated. -- nathan
> Hm. Any idea why that is? I wonder if the compiler isn't using as many
> SVE registers as it could for this.
> SVE registers as it could for this.
Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.
pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native
> I've also noticed that the latest patch doesn't compile on my M3 macOS
> machine. After a quick glance, I think the problem is that the
> TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
> versions.
> machine. After a quick glance, I think the problem is that the
> TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
> versions.
Fixed, we tried using the existing "choose" logic guarded by TRY_POPCNT_FAST.
The latest patch bypasses TRY_POPCNT_FAST by having a separate choose logic
for aarch64.
-Chiranmoy
Attachment
On Wed, Feb 19, 2025 at 09:31:50AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: >> Hm. Any idea why that is? I wonder if the compiler isn't using as many >> SVE registers as it could for this. > > Not sure, we tried forcing loop unrolling using the below line in the MakeFile > but the results are the same. > > pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native Interesting. I do see different assembly with the 2 and 4 register versions, but I didn't get to testing it on a machine with SVE support today. Besides some additional benchmarking, I might make some small adjustments to the patch. But overall, it seems to be in decent shape. -- nathan
> Interesting. I do see different assembly with the 2 and 4 register
> versions, but I didn't get to testing it on a machine with SVE support
> today.
> Besides some additional benchmarking, I might make some small adjustments
> to the patch. But overall, it seems to be in decent shape.
> versions, but I didn't get to testing it on a machine with SVE support
> today.
> Besides some additional benchmarking, I might make some small adjustments
> to the patch. But overall, it seems to be in decent shape.
Sounds good. Let us know your findings.
-Chiranmoy
On Fri, Mar 07, 2025 at 03:20:07AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: > Sounds good. Let us know your findings. Alright, here's what I saw on an R8g for drive_popcount(1000000, N): 8-byte words master v5-no-sve v5-sve v5-4reg 1 2.540 ms 2.170 ms 1.807 ms 2.178 ms 2 2.534 ms 2.180 ms 1.804 ms 2.167 ms 4 3.988 ms 3.240 ms 1.590 ms 2.879 ms 8 5.033 ms 4.672 ms 2.175 ms 2.525 ms 16 8.252 ms 10.916 ms 3.235 ms 3.588 ms 32 20.932 ms 22.883 ms 5.134 ms 5.395 ms 64 40.446 ms 45.668 ms 9.817 ms 9.285 ms 128 66.087 ms 91.386 ms 20.072 ms 17.175 ms 256 153.852 ms 182.594 ms 40.447 ms 32.212 ms 512 246.271 ms 300.941 ms 87.116 ms 60.729 ms 1024 487.180 ms 607.289 ms 180.574 ms 116.948 ms 2048 969.335 ms 1223.838 ms 363.595 ms 232.575 ms 4096 1934.646 ms 2472.154 ms 729.525 ms 459.495 ms (Note that there should be no need to test anything smaller than 8 bytes because we use the inline version in pg_bitutils.h in that case.) v5-no-sve is the result of using a function pointer, but pointing to the "slow" versions instead of the SVE version. v5-sve is the result of the latest patch in this thread on a machine with SVE support, and v5-4reg is the result of the latest patch in this thread modified to process 4 register's worth of data at a time. The biggest takeaways for me are as follows: * The 4-register version does show some nice improvements as the data grows. * Machines without SVE will likely incur a rather sizable regression from the newly introduced function pointer. For the latter point, I think we should consider trying to add a separate Neon implementation that we use as a fallback for machines that don't have SVE. My understanding is that Neon is virtually universally supported on 64-bit Arm gear, so that will not only help offset the function pointer overhead but may even improve performance for a much wider set of machines. -- nathan
On Wed, Mar 12, 2025 at 02:41:18AM +0000, nathandbossart@gmail.com wrote:
> v5-no-sve is the result of using a function pointer, but pointing to the
> "slow" versions instead of the SVE version. v5-sve is the result of the
> latest patch in this thread on a machine with SVE support, and v5-4reg is
> the result of the latest patch in this thread modified to process 4
> register's worth of data at a time.
> "slow" versions instead of the SVE version. v5-sve is the result of the
> latest patch in this thread on a machine with SVE support, and v5-4reg is
> the result of the latest patch in this thread modified to process 4
> register's worth of data at a time.
Nice, I wonder why I did not observe any performance gain in the 4reg
version. Did you modify the 4reg version code?
One possible explanation is that you used Graviton4 based instances
whereas I used Graviton3 instances.
> For the latter point, I think we should consider trying to add a separate
> Neon implementation that we use as a fallback for machines that don't have
> SVE. My understanding is that Neon is virtually universally supported on
> 64-bit Arm gear, so that will not only help offset the function pointer
> overhead but may even improve performance for a much wider set of machines.
> For the latter point, I think we should consider trying to add a separate
> Neon implementation that we use as a fallback for machines that don't have
> SVE. My understanding is that Neon is virtually universally supported on
> 64-bit Arm gear, so that will not only help offset the function pointer
> overhead but may even improve performance for a much wider set of machines.
I have added the NEON implementation in the latest patch.
Here are the numbers for drive_popcount(1000000, 1024) on m7g.8xlarge:
Scalar - 692ms
Neon - 298ms
SVE - 112ms
-Chiranmoy
SVE - 112ms
-Chiranmoy
Attachment
On Wed, Mar 12, 2025 at 10:34:46AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote: > On Wed, Mar 12, 2025 at 02:41:18AM +0000, nathandbossart@gmail.com wrote: > >> v5-no-sve is the result of using a function pointer, but pointing to the >> "slow" versions instead of the SVE version. v5-sve is the result of the >> latest patch in this thread on a machine with SVE support, and v5-4reg is >> the result of the latest patch in this thread modified to process 4 >> register's worth of data at a time. > > Nice, I wonder why I did not observe any performance gain in the 4reg > version. Did you modify the 4reg version code? > > One possible explanation is that you used Graviton4 based instances > whereas I used Graviton3 instances. Yeah, it looks like the number of vector registers is different [0]. >> For the latter point, I think we should consider trying to add a separate >> Neon implementation that we use as a fallback for machines that don't have >> SVE. My understanding is that Neon is virtually universally supported on >> 64-bit Arm gear, so that will not only help offset the function pointer >> overhead but may even improve performance for a much wider set of machines. > > I have added the NEON implementation in the latest patch. > > Here are the numbers for drive_popcount(1000000, 1024) on m7g.8xlarge: > Scalar - 692ms > Neon - 298ms > SVE - 112ms Those are nice results. I'm a little worried about the Neon implementation for smaller inputs since it uses a per-byte loop for the remaining bytes, though. If we can ensure there's no regression there, I think this patch will be in decent shape. [0] https://github.com/aws/aws-graviton-getting-started?tab=readme-ov-file#building-for-graviton -- nathan
On Wed, Mar 13, 2025 at 12:02:07AM +0000, nathandbossart@gmail.com wrote:
> Those are nice results. I'm a little worried about the Neon implementation
> for smaller inputs since it uses a per-byte loop for the remaining bytes,
> though. If we can ensure there's no regression there, I think this patch
> will be in decent shape.
> for smaller inputs since it uses a per-byte loop for the remaining bytes,
> though. If we can ensure there's no regression there, I think this patch
> will be in decent shape.
True, the neon implementation in patch v6 did perform worse for smaller inputs.
This is solved in v7, we have added pg_popcount64 to speed up the processing of
This is solved in v7, we have added pg_popcount64 to speed up the processing of
smaller inputs/remaining bytes. Also, similar to sve, the neon-2reg version
performed better than neon-1reg but no improvement in neon-4reg.
The below table compares patches v6 and v7 on m7g.4xlarge
Query: SELECT drive_popcount(1000000, 8-byte words);
8-byte words | master | v6-neon-2reg| v7-neon-2reg| v7-sve
--------------+----------+-------------+-------------+--------
1 | 4.051 | 6.239 | 3.431 | 3.343
2 | 4.429 | 10.773 | 3.899 | 3.335
3 | 4.844 | 14.066 | 4.398 | 3.348
4 | 5.324 | 3.342 | 3.663 | 3.365
5 | 5.900 | 7.108 | 4.349 | 4.441
6 | 6.478 | 11.720 | 4.851 | 4.441
7 | 7.192 | 15.686 | 5.551 | 4.447
8 | 8.016 | 4.288 | 4.367 | 4.013
We modified [0] to get the numbers for pg_popcount_masked
8-byte words | master | v7-neon-2reg| v7-sve
--------------+----------+-------------+--------
1 | 4.289 | 4.202 | 3.827
2 | 4.993 | 4.662 | 3.823
3 | 5.981 | 5.459 | 3.834
4 | 6.438 | 4.230 | 3.846
5 | 7.169 | 5.236 | 5.072
6 | 7.949 | 5.922 | 5.106
7 | 9.130 | 6.535 | 5.060
8 | 9.796 | 5.328 | 4.718
512 | 387.543 | 182.801 | 77.077
1024 | 760.644 | 360.660 | 150.519
-Chiranmoy
Attachment
I've been preparing these for commit, and I've attached what I have so far. A few notes: * 0001 just renames the TRY_POPCNT_FAST macro to indicate that it's x86_64-specific. IMO this is worth doing indpendent of this patch set, but it's more important with the patch set since we need something similar for Aarch64. I think we should also consider moving the x86_64 stuff to its own file (perhaps combining it with the AVX-512 stuff), but that can probably wait until later. * 0002 introduces the Neon implementation, which conveniently doesn't need configure-time checks or function pointers. I noticed that some compilers (e.g., Apple clang 16) compile in Neon instructions already, but our hand-rolled implementation is better about instruction-level parallelism and seems to still be quite a bit faster. * 0003 introduces the SVE implementation. You'll notice I've moved all the function pointer gymnastics into the pg_popcount_aarch64.c file, which is where the Neon implementations live, too. I also tried to clean up the configure checks a bit. I imagine it's possible to make them more compact, but I felt that the enhanced readability was worth it. * For both Neon and SVE, I do see improvements with looping over 4 registers at a time, so IMHO it's worth doing so even if it performs the same as 2-register blocks on some hardware. I did add a 2-register block in the Neon implementation for processing the tail because I was worried about its performance on smaller buffers, but that part might get removed if I can't measure any difference. I'm planning to run several more benchmarks, but everything I've seen thus far has looked pretty good. -- nathan
Attachment
Looks good, the code is more readable now.
> For both Neon and SVE, I do see improvements with looping over 4
> registers at a time, so IMHO it's worth doing so even if it performs the
> same as 2-register blocks on some hardware.
> For both Neon and SVE, I do see improvements with looping over 4
> registers at a time, so IMHO it's worth doing so even if it performs the
> same as 2-register blocks on some hardware.
There was no regression on Graviton 3 when using the 4-register version so can keep it.
-Chiranmoy
On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > * 0002 introduces the Neon implementation, which conveniently doesn't need > configure-time checks or function pointers. I noticed that some > compilers (e.g., Apple clang 16) compile in Neon instructions already, > but our hand-rolled implementation is better about instruction-level > parallelism and seems to still be quite a bit faster. +pg_popcount64(uint64 word) +{ + return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word))); +} This confused me until I found that this is what __builtin_popcountl(word) would emit anyway. Worth a comment? Some thoughts to consider, some speculative and maybe not worth putting time into: > I did add a 2-register block > in the Neon implementation for processing the tail because I was worried > about its performance on smaller buffers, but that part might get removed > if I can't measure any difference. Even if we can measure a difference on fixed-sized inputs, that might not carry over when the branch is unpredictable. + /* + * Process remaining 8-byte blocks. + */ + for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64)) + { + popcnt += pg_popcount64(*((uint64 *) buf)); + buf += sizeof(uint64); + } This uses 16-byte registers, but only loads 8-bytes at a time (with accumulation work), then a bytewise tail up to 7 bytes. Alternatively, you could instead do a loop over a single local accumulator, which I think could have a short accumulation pipeline since 3 iterations can't overflow 8-bit lanes. But then the bytewise tail could be up to 15 bytes. > * 0003 introduces the SVE implementation. You'll notice I've moved all the > function pointer gymnastics into the pg_popcount_aarch64.c file, which is > where the Neon implementations live, too. I also tried to clean up the > configure checks a bit. I imagine it's possible to make them more > compact, but I felt that the enhanced readability was worth it. I don't know what the configure checks looked like before, but I'm confused that the loops are unrolled in the link-test functions as well. > * For both Neon and SVE, I do see improvements with looping over 4 > registers at a time, so IMHO it's worth doing so even if it performs the > same as 2-register blocks on some hardware. I wonder if alignment matters for these larger blocks. -- John Naylor Amazon Web Services
On Mon, Mar 24, 2025 at 06:34:45PM +0700, John Naylor wrote: > On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> * 0002 introduces the Neon implementation, which conveniently doesn't need >> configure-time checks or function pointers. I noticed that some >> compilers (e.g., Apple clang 16) compile in Neon instructions already, >> but our hand-rolled implementation is better about instruction-level >> parallelism and seems to still be quite a bit faster. > > +pg_popcount64(uint64 word) > +{ > + return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word))); > +} > > This confused me until I found that this is what > __builtin_popcountl(word) would emit anyway. Worth a comment? Sure thing. > + /* > + * Process remaining 8-byte blocks. > + */ > + for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64)) > + { > + popcnt += pg_popcount64(*((uint64 *) buf)); > + buf += sizeof(uint64); > + } > > This uses 16-byte registers, but only loads 8-bytes at a time (with > accumulation work), then a bytewise tail up to 7 bytes. Alternatively, > you could instead do a loop over a single local accumulator, which I > think could have a short accumulation pipeline since 3 iterations > can't overflow 8-bit lanes. But then the bytewise tail could be up to > 15 bytes. Yeah, I wasn't sure how far we wanted to go with this. We could do 4 registers at a time, then 2, then 1, then 8-bytes, then byte-by-byte, but that's quite a few extra lines of code for the amount of gain, not to mention the extra overhead. My inclination was to try to keep this as simple as possible while making sure we didn't regress on small inputs. >> * 0003 introduces the SVE implementation. You'll notice I've moved all the >> function pointer gymnastics into the pg_popcount_aarch64.c file, which is >> where the Neon implementations live, too. I also tried to clean up the >> configure checks a bit. I imagine it's possible to make them more >> compact, but I felt that the enhanced readability was worth it. > > I don't know what the configure checks looked like before, but I'm > confused that the loops are unrolled in the link-test functions as > well. We do need the two separate blocks because they use different intrinsic functions, but I could probably remove the actual "for" loops themselves to simplify things a bit. >> * For both Neon and SVE, I do see improvements with looping over 4 >> registers at a time, so IMHO it's worth doing so even if it performs the >> same as 2-register blocks on some hardware. > > I wonder if alignment matters for these larger blocks. Some earlier benchmarks didn't show anything outside of the noise range [0]. [0] https://postgr.es/m/OSBPR01MB266403FD4C05DAB58EBBA82897EF2%40OSBPR01MB2664.jpnprd01.prod.outlook.com -- nathan
I've attached a new set of patches in which I've tried to address John's feedback. I ran some new benchmarks with these patches. "M3" is an Apple M3 (my laptop), "G3" is an r7g.4xlarge, and "G4" is an r8g.4xlarge. "no SVE" means the patches are applied but the function pointer points to the Neon implementation. "SVE" and "patched" mean all the patches are applied with no changes. 8 byte words | M3 HEAD | M3 patched | G3 HEAD | G3 no SVE | G3 SVE | G4 HEAD | G4 no SVE | G4 SVE --------------+---------+------------+---------+-----------+---------+---------+-----------+--------- 1 | 3.6 | 3.0 | 3.1 | 2.9 | 3.1 | 2.5 | 2.2 | 1.8 2 | 6.4 | 4.4 | 3.1 | 3.0 | 3.1 | 2.5 | 2.5 | 2.0 3 | 7.3 | 6.9 | 3.5 | 3.5 | 3.1 | 3.3 | 3.2 | 2.0 4 | 8.0 | 3.8 | 4.0 | 2.7 | 4.7 | 3.6 | 2.2 | 2.7 5 | 9.4 | 5.5 | 4.6 | 2.8 | 4.6 | 3.9 | 2.5 | 2.7 6 | 7.9 | 5.0 | 5.1 | 3.5 | 4.7 | 4.3 | 3.1 | 3.4 7 | 10.2 | 7.4 | 5.9 | 4.0 | 4.7 | 4.7 | 3.6 | 3.4 8 | 12.0 | 5.4 | 6.5 | 4.0 | 5.9 | 5.0 | 3.2 | 2.5 9 | 11.7 | 6.5 | 7.2 | 4.3 | 5.9 | 5.4 | 3.6 | 2.5 10 | 12.5 | 5.4 | 8.0 | 4.8 | 5.9 | 6.2 | 3.9 | 3.1 11 | 14.0 | 8.6 | 8.5 | 5.5 | 5.9 | 6.1 | 5.0 | 3.1 12 | 13.1 | 5.7 | 9.1 | 5.1 | 7.4 | 6.4 | 3.9 | 3.6 13 | 12.1 | 6.8 | 9.8 | 5.4 | 7.3 | 6.8 | 4.3 | 3.6 14 | 16.4 | 7.8 | 10.4 | 5.9 | 7.4 | 7.2 | 4.7 | 4.4 15 | 17.4 | 8.0 | 11.1 | 6.6 | 7.4 | 7.5 | 5.7 | 4.4 16 | 15.5 | 5.7 | 11.8 | 5.7 | 4.7 | 7.9 | 5.0 | 3.5 32 | 26.0 | 16.2 | 22.7 | 10.3 | 6.2 | 16.8 | 8.4 | 5.2 64 | 38.5 | 20.3 | 42.7 | 20.1 | 9.3 | 31.8 | 15.4 | 8.8 128 | 75.1 | 35.7 | 86.1 | 35.0 | 15.4 | 80.2 | 28.6 | 16.3 256 | 117.7 | 51.8 | 179.6 | 68.2 | 27.8 | 154.0 | 55.7 | 30.9 512 | 198.5 | 93.1 | 329.3 | 134.4 | 52.4 | 246.5 | 110.2 | 59.4 1024 | 355.0 | 159.2 | 673.6 | 265.8 | 101.7 | 487.0 | 219.0 | 114.7 2048 | 669.5 | 288.8 | 1294.7 | 529.7 | 200.3 | 969.3 | 438.7 | 228.5 4096 | 1308.0 | 552.8 | 2784.3 | 1063.0 | 397.4 | 1934.5 | 874.4 | 455.9 IMHO these are acceptable results, at least for the use-cases I see in the tree. We might be able to minimize the difference between the Neon and SVE implementations on the low end with some additional code, but I'm really not sure if it's worth the effort. Barring feedback or objections, I'm planning to commit these on Friday. -- nathan
Attachment
On Wed, Mar 26, 2025 at 04:44:24PM -0500, Nathan Bossart wrote: > IMHO these are acceptable results, at least for the use-cases I see in the > tree. We might be able to minimize the difference between the Neon and SVE > implementations on the low end with some additional code, but I'm really > not sure if it's worth the effort. I couldn't resist... I tried a variety of things (e.g., inlining the Neon implementation to process the tail, jumping to the Neon implementation for smaller inputs), and the only thing that seemed to be a clear win was to add a 2-register block in the SVE implementations (like what is already there for the Neon ones). In particular, that helps bring the Graviton3 SVE numbers closer to the Neon numbers for inputs between 8-16 8-byte words. I also noticed a silly mistake in 0003 that would cause us to potentially skip part of the tail. That should be fixed now. -- nathan
Attachment
On Thu, Mar 27, 2025 at 10:38 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > I also noticed a silly mistake in 0003 that would cause us to potentially > skip part of the tail. That should be fixed now. I'm not sure whether that meant it could return the wrong answer, or just make more work for paths further down. If the former, then our test coverage is not adequate. Aside from that, I only found one more thing that may be important: I tried copying the configure/meson checks into godbolt.org, and both gcc and clang don't like it, so unless there is something weird about their setup (or my use of it) it's possible some other hosts won't like it either.: ``` <source>:29:10: error: call to 'svwhilelt_b8' is ambiguous pred = svwhilelt_b8(0, sizeof(buf)); ^~~~~~~~~~~~ /opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15526:10: note: candidate function svbool_t svwhilelt_b8(uint64_t, uint64_t); ^ /opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15534:10: note: candidate function svbool_t svwhilelt_b8(int32_t, int32_t); ^ <source>: In function 'autoconf_popcount_test': <source>:29:24: error: call to 'svwhilelt_b8' is ambiguous; argument 1 has type 'int32_t' but argument 2 has type 'uint64_t' 29 | pred = svwhilelt_b8(0, sizeof(buf)); | ^~~~~~~~~~~~ Compiler returned: 1 ``` ...Changing it to pred = svwhilelt_b8((uint64_t)0, sizeof(buf));" clears it up. -- John Naylor Amazon Web Services
On Thu, Mar 27, 2025 at 03:31:27PM +0700, John Naylor wrote: > On Thu, Mar 27, 2025 at 10:38 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> I also noticed a silly mistake in 0003 that would cause us to potentially >> skip part of the tail. That should be fixed now. > > I'm not sure whether that meant it could return the wrong answer, or > just make more work for paths further down. > If the former, then our test coverage is not adequate. This one is my bad. I think the issue is that I'm writing this stuff on a machine that doesn't have SVE, so obviously my tests are happy as long as the Neon stuff is okay. We do have some tests in bit.sql that should in theory find this stuff. I'll be sure to verify all of this on a machine with SVE... > Aside from that, I only found one more thing that may be important: I > tried copying the configure/meson checks into godbolt.org, and both > gcc and clang don't like it, so unless there is something weird about > their setup (or my use of it) it's possible some other hosts won't > like it either.: > > ``` > <source>:29:10: error: call to 'svwhilelt_b8' is ambiguous > pred = svwhilelt_b8(0, sizeof(buf)); > ^~~~~~~~~~~~ > /opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15526:10: > note: candidate function > svbool_t svwhilelt_b8(uint64_t, uint64_t); > ^ > /opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15534:10: > note: candidate function > svbool_t svwhilelt_b8(int32_t, int32_t); > ^ > > <source>: In function 'autoconf_popcount_test': > <source>:29:24: error: call to 'svwhilelt_b8' is ambiguous; argument 1 > has type 'int32_t' but argument 2 has type 'uint64_t' > 29 | pred = svwhilelt_b8(0, sizeof(buf)); > | ^~~~~~~~~~~~ > Compiler returned: 1 > ``` > > ...Changing it to > > pred = svwhilelt_b8((uint64_t)0, sizeof(buf));" > > clears it up. Huh, this makes sense, but for some reason Apple clang is fine with it. In any case, I see that we can optionally specify the expected types of the arguments by using svwhilelt_b8_u32() (or _u64, etc.). IMHO that is far clearer. I'm going to add that to all intrinsics that support it in the next version of the patch set. -- nathan
Committed. On Fri, Mar 28, 2025 at 10:25:26AM -0500, Nathan Bossart wrote: > On Thu, Mar 27, 2025 at 03:31:27PM +0700, John Naylor wrote: >> On Thu, Mar 27, 2025 at 10:38 AM Nathan Bossart >> <nathandbossart@gmail.com> wrote: >>> I also noticed a silly mistake in 0003 that would cause us to potentially >>> skip part of the tail. That should be fixed now. >> >> I'm not sure whether that meant it could return the wrong answer, or >> just make more work for paths further down. >> If the former, then our test coverage is not adequate. > > This one is my bad. I think the issue is that I'm writing this stuff on a > machine that doesn't have SVE, so obviously my tests are happy as long as > the Neon stuff is okay. We do have some tests in bit.sql that should in > theory find this stuff. I'll be sure to verify all of this on a machine > with SVE... I verified that the tests failed without this fix on a machine with SVE. -- nathan