Thread: [PATCH] SVE popcount support

[PATCH] SVE popcount support

From
"Malladi, Rama"
Date:

Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0] to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.

 

[0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Re: [PATCH] SVE popcount support

From
Kirill Reshke
Date:
On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote:
>
> Attachments protected by Amazon: 0001-SVE-popcount-support.patch | SVE-popcount-support-PostgreSQL.png |
> Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27,
2024,15:43 (UTC+00:00). Tell us what you think 
> For more information click here
>
> Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0]
toput together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and
itshowed the expected improvements with many bytes. 
>
>
>
> [0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Hi! To register entry on commitfest you need to send patch in one of
this format:

https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F

This is useful for reviewers who use cfbot or cputube.

--
Best regards,
Kirill Reshke



Re: [PATCH] SVE popcount support

From
Kirill Reshke
Date:


On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote:
>
> Attachments protected by Amazon: 0001-SVE-popcount-support.patch | SVE-popcount-support-PostgreSQL.png |
> Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us what you think
> For more information click here
>
> Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0] to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.
>

>
> [0] https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Hi!
I did look inside this patch. This was implemented mostly in the same way as the current instructure selecting code, which is good.


=== patch itself

1)
> // for small buffer sizes (<= 128-bytes), execute 1-byte SVE instructions
> // for larger buffer sizes (> 128-bytes), execute 1-byte + 8-byte SVE instructions
> // loop unroll by 2
PostgreSQL uses /*  */ comment style.

2)

> + if (bytes <= 128)
> + {
> + prologue_loop_bytes = bytes;
> + }
> + else
> + {
> + aligned_buf   = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf) + sizeof(uint64_t);
> + prologue_loop_bytes   = aligned_buf - buf;
> + }


For a single line stmt PostgreSQL does not use parenthesis. Examples [0] & [1]


[0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68


3) `if (bytes > 128)` Loop in pg_popcount_sve function should be commented. There is too much code without any comment why it works. For example, If original source of this is some paper or other work, we can reference it.


==== by-hand benching (I also use  John Naylor's module)

non-patched

```
db1=# \timing
Timing is on.
db1=# select drive_popcount(10000000, 10000);
 drive_popcount
----------------
          64608
(1 row)

Time: 8886.493 ms (00:08.886) -- with small variance (+- 100ms)

db1=# select drive_popcount64(10000000, 10000);
 drive_popcount64
------------------
            64608
(1 row)

Time: 139501.555 ms (02:19.502) with small variance (+- 1-2sec)
```

patched

```
db1=# select drive_popcount(10000000, 10000);
 drive_popcount
----------------
          64608
(1 row)

Time: 8803.855 ms (00:08.804) -- with small variance
db1=# select drive_popcount64(10000000, 10000);
 drive_popcount64
------------------
            64608
(1 row)

Time: 200716.879 ms (02:21.717) -- with small variance
```

I'm not sure how to interpret these results. Looks like this does not help much on a large $num?

--
Best regards,
Kirill Reshke

Re: [PATCH] SVE popcount support

From
Bruce Momjian
Date:
On Wed, Nov 27, 2024 at 03:43:27PM +0000, Malladi, Rama wrote:
>   • Attachments protected by Amazon:
>   • 0001-SVE-popcount-support.patch |
>   • SVE-popcount-support-PostgreSQL.png |
> 
> Amazon has replaced the attachments in this email with download links.
> Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us
> what you think
> For more information click here
> 
> Please find attached a patch to PostgreSQL implementing SVE popcount. I used
> John Naylor's test_popcount module [0] to put together the attached graphs.
> This test didn't show any regressions with a relatively small number of bytes,
> and it showed the expected improvements with many bytes.

You must attach actual attachments for this to be considered.  Download
links are unacceptable.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"



Re: [PATCH] SVE popcount support

From
Nathan Bossart
Date:
On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:
> Thank you, Kirill, for the review and the feedback. Please find inline my
> reply and an updated patch.

Thanks for the updated patch.  I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/) so that it is picked up by our
automated patch testing tools?

> +# Check for ARMv8 SVE popcount intrinsics
> +#
> +CFLAGS_POPCNT=""
> +PG_POPCNT_OBJS=""
> +PGAC_SVE_POPCNT_INTRINSICS([])
> +if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
> +  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
> +fi
> +if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
> +  PG_POPCNT_OBJS="pg_popcount_sve.o"
> +  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime
check.])
> +fi
> +AC_SUBST(CFLAGS_POPCNT)
> +AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27).  The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code.  IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major version of
Postgres, which will take some time for significant adoption, and 2) this
is brand new code, so we aren't introducing any regressions for current
users.

> +#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
> +extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);

Could we combine this with the existing copy above this line?  I'm thinking
of something like

    #if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
    extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
    #endif

    #ifdef TRY_POPCNT_FAST
    ...

> +#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
> +extern uint64 pg_popcount_sve(const char *buf, int bytes);
> +extern int check_sve_support(void);
> +#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

> +static inline uint64
> +pg_popcount_choose(const char *buf, int bytes)
> +{
> +    if (check_sve_support())
> +        pg_popcount_optimized = pg_popcount_sve;
> +    else
> +        pg_popcount_optimized = pg_popcount_slow;
> +    return pg_popcount_optimized(buf, bytes);
> +}
> +
> +#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

> +// check if sve supported
> +int check_sve_support(void)
> +{
> +    // Read ID_AA64PFR0_EL1 register
> +    uint64_t pfr0;
> +    __asm__ __volatile__(
> +    "mrs %0, ID_AA64PFR0_EL1"
> +    : "=r" (pfr0));
> +
> +    // SVE bits are 32-35
> +    return (pfr0 >> 32) & 0xf;
> +}

Is this based on some reference code from a manual that we could cite here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

> +/*
> + * pg_popcount_sve
> + *              Returns the number of 1-bits in buf
> + */
> +uint64
> +pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

-- 
nathan



Re: [PATCH] SVE popcount support

From
Nathan Bossart
Date:
The meson configure check seems to fail on my machine:

    error: too many arguments to function call, expected 0, have 1
       10 |     svuint64_t popcnt = svcntb(val);
          |                         ~~~~~~ ^~~

    error: returning '__SVInt64_t' from a function with incompatible result type 'int'
       12 |     return popcnt == 0;
          |            ^~~~~~~~~~~

The autoconf version seems to work okay, though.

+    pgac_save_CFLAGS=$CFLAGS
+    CFLAGS="$pgac_save_CFLAGS $1"

I don't see any extra compiler flag tests used, so we no longer need this,
right?

+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi

I'm curious why this doesn't use Ac_cachevar like the examples above it
(e.g., PGAC_XSAVE_INTRINSICS).

+  prog = '''
+#include <arm_sve.h>
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    const svuint64_t val = svdup_u64(0xFFFFFFFFFFFFFFFF);
+    svuint64_t popcnt = svcntb(val);
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+'''

This test looks quite different than the autoconf one.  Why is that?  I
would expect them to be the same.  And I think ideally the test would check
that all the intrinsics functions we need are available.

+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+    return getauxval(AT_HWCAP) & HWCAP_SVE;
+}

pg_crc32c_armv8_available() (in pg_crc32c_armv8_choose.c) looks quite a bit
more complicated than this.  Are we missing something here?

+    /*
+     * For smaller inputs, aligning the buffer degrades the performance.
+     * Therefore, the buffers only when the input size is sufficiently large.
+     */

Is the inverse true, i.e., does aligning the buffer improve performance for
larger inputs?  I'm also curious what level of performance degradation you
were seeing.

+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK

nitpick: The USE_SVE_POPCNT_WITH_RUNTIME_CHECK check can probably go above
the #include for pg_bitutils.h (but below the one for c.h).

-- 
nathan



Re: [PATCH] SVE popcount support

From
Nathan Bossart
Date:
On Tue, Feb 04, 2025 at 09:01:33AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
>> +    /*
>> +     * For smaller inputs, aligning the buffer degrades the performance.
>> +     * Therefore, the buffers only when the input size is sufficiently large.
>> +     */
> 
>> Is the inverse true, i.e., does aligning the buffer improve performance for
>> larger inputs?  I'm also curious what level of performance degradation you
>> were seeing.
> 
> Here is a comparison of all three cases. Alignment is marginally better for inputs
> above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
> Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
> Always Aligned => condition "bytes > 4 * vec_len" is removed.
> Unaligned => the whole if block was removed
> 
>  buf    | Always Aligned | Aligned After 128B | Unaligned
> --------+---------------+--------------------+------------
>    16   |       37.851  |           38.203   |     34.971
>    32   |       37.859  |           38.187   |     34.972
>    64   |       37.611  |           37.405   |     34.121
>   128   |       45.357  |           45.897   |     41.890
>   256   |       62.440  |           63.454   |     58.666
>   512   |      100.120  |          102.767   |     99.861
>  1024   |      159.574  |          158.594   |    164.975
>  2048   |      282.354  |          281.198   |    283.937
>  4096   |      532.038  |          531.068   |    533.699
>  8192   |     1038.973  |         1038.083   |   1039.206
> 16384   |     2028.604  |         2025.843   |   2033.940

Hm.  These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment.  Is there any reason not
to do that?

+    /* Process 2 complete vectors */
+    for (; i < loop_bytes; i += vec_len * 2)
+    {
+        vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+        accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+        vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+        accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+    }

Does this hand-rolled loop unrolling offer any particular advantage?  What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

-- 
nathan



Re: [PATCH] SVE popcount support

From
Nathan Bossart
Date:
On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
>> Does this hand-rolled loop unrolling offer any particular advantage?  What
>> do the numbers look like if we don't do this or if we process, say, 4
>> vectors at a time?
> 
> The unrolled version performs better than the non-unrolled one, but
> processing four vectors provides no additional benefit. The numbers
> and code used are given below.

Hm.  Any idea why that is?  I wonder if the compiler isn't using as many
SVE registers as it could for this.

-- 
nathan



Re: [PATCH] SVE popcount support

From
Nathan Bossart
Date:
On Thu, Feb 06, 2025 at 10:33:35AM -0600, Nathan Bossart wrote:
> On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
>>> Does this hand-rolled loop unrolling offer any particular advantage?  What
>>> do the numbers look like if we don't do this or if we process, say, 4
>>> vectors at a time?
>> 
>> The unrolled version performs better than the non-unrolled one, but
>> processing four vectors provides no additional benefit. The numbers
>> and code used are given below.
> 
> Hm.  Any idea why that is?  I wonder if the compiler isn't using as many
> SVE registers as it could for this.

I've also noticed that the latest patch doesn't compile on my M3 macOS
machine.  After a quick glance, I think the problem is that the
TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
versions.

../postgresql/src/port/pg_bitutils.c:230:41: error: invalid output constraint '=q' in asm
  230 | __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
      |                                         ^
../postgresql/src/port/pg_bitutils.c:247:41: error: invalid output constraint '=q' in asm
  247 | __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
      |                                         ^
2 errors generated.

-- 
nathan



Re: [PATCH] SVE popcount support

From
"Chiranmoy.Bhattacharya@fujitsu.com"
Date:
> Hm.  Any idea why that is?  I wonder if the compiler isn't using as many
> SVE registers as it could for this.

Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.

pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native


> I've also noticed that the latest patch doesn't compile on my M3 macOS
> machine.  After a quick glance, I think the problem is that the
> TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
> versions.

Fixed, we tried using the existing "choose" logic guarded by TRY_POPCNT_FAST.
The latest patch bypasses TRY_POPCNT_FAST by having a separate choose logic
for aarch64. 


-Chiranmoy
Attachment

Re: [PATCH] SVE popcount support

From
Nathan Bossart
Date:
On Wed, Feb 19, 2025 at 09:31:50AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:
>> Hm.  Any idea why that is?  I wonder if the compiler isn't using as many
>> SVE registers as it could for this.
> 
> Not sure, we tried forcing loop unrolling using the below line in the MakeFile
> but the results are the same.
> 
> pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native

Interesting.  I do see different assembly with the 2 and 4 register
versions, but I didn't get to testing it on a machine with SVE support
today.

Besides some additional benchmarking, I might make some small adjustments
to the patch.  But overall, it seems to be in decent shape.

-- 
nathan