Thread: CRC32C Parallel Computation Optimization on ARM

CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

20 October 2023, 07:08:58

Hi all

This patch uses a parallel computing optimization algorithm to improve crc32c computing performance on ARM. The algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided into three equal-sized blocks.Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes

Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4

Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9

It gets ~2x speedup compared to linear Arm crc32c instructions.

I'll create a CommitFests ticket for this submission.

Any comments or feedback are welcome.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

0001-crc32c-parallel-computation-optimization-on-arm.patch

Re: CRC32C Parallel Computation Optimization on ARM

From

Michael Paquier

Date:

20 October 2023, 08:18:56

On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:
> This patch uses a parallel computing optimization algorithm to
> improve crc32c computing performance on ARM. The algorithm comes
> from Intel whitepaper:
> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
> into three equal-sized blocks.Three parallel blocks (crc0, crc1,
> crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
> crc32c_u64) bytes
>
> Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
> Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
> It gets ~2x speedup compared to linear Arm crc32c instructions.

Interesting.  Could you attached to this thread the test files you
used and the results obtained please?  If this data gets deleted from
github, then it would not be possible to refer back to what you did at
the related benchmark results.

Note that your patch is forgetting about meson; it just patches
./configure.
--
Michael

Attachment

signature.asc

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

24 October 2023, 21:09:54

On Fri, Oct 20, 2023 at 05:18:56PM +0900, Michael Paquier wrote:
> On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:
>> This patch uses a parallel computing optimization algorithm to
>> improve crc32c computing performance on ARM. The algorithm comes
>> from Intel whitepaper:
>> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
>> into three equal-sized blocks.Three parallel blocks (crc0, crc1,
>> crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
>> crc32c_u64) bytes 
>> 
>> Crc32c unitest: https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
>> Crc32c benchmark: https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
>> It gets ~2x speedup compared to linear Arm crc32c instructions.
> 
> Interesting.  Could you attached to this thread the test files you
> used and the results obtained please?  If this data gets deleted from
> github, then it would not be possible to refer back to what you did at
> the related benchmark results.
> 
> Note that your patch is forgetting about meson; it just patches
> ./configure.

I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions).  There was almost
no change for the 512-byte case, but there was a ~60% speedup for the
4096-byte case.

However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
benchmark [0].  I haven't had a chance to dig further, unfortunately.
Assuming I'm not doing something wrong, I don't think such a result should
necessarily disqualify this optimization, though.

[0] https://postgr.es/m/ec487192-f6aa-509a-cacb-6642dad14209%40iki.fi

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

24 October 2023, 21:18:06

On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:
> I'm able to reproduce the speedup with the provided benchmark on an Apple
> M1 Pro (which appears to have the required instructions).  There was almost
> no change for the 512-byte case, but there was a ~60% speedup for the
> 4096-byte case.
> 
> However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
> benchmark [0].  I haven't had a chance to dig further, unfortunately.
> Assuming I'm not doing something wrong, I don't think such a result should
> necessarily disqualify this optimization, though.

Actually, since the pg_waldump benchmark likely only involves very small
WAL records, it would make sense that there isn't much difference.
*facepalm*

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: CRC32C Parallel Computation Optimization on ARM

From

Heikki Linnakangas

Date:

24 October 2023, 21:37:45

On 25/10/2023 00:18, Nathan Bossart wrote:
> On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:
>> I'm able to reproduce the speedup with the provided benchmark on an Apple
>> M1 Pro (which appears to have the required instructions).  There was almost
>> no change for the 512-byte case, but there was a ~60% speedup for the
>> 4096-byte case.
>>
>> However, I couldn't produce any noticeable speedup with Heikki's pg_waldump
>> benchmark [0].  I haven't had a chance to dig further, unfortunately.
>> Assuming I'm not doing something wrong, I don't think such a result should
>> necessarily disqualify this optimization, though.
> 
> Actually, since the pg_waldump benchmark likely only involves very small
> WAL records, it would make sense that there isn't much difference.
> *facepalm*

No need to guess, pg_waldump -z will tell you what the record size is. 
And you can vary it by changing the checkpoint interval and/or pgbench 
scale factor: if you checkpoint frequently or if the database is larger, 
you get more full-page images which makes the records larger on average, 
and vice versa.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: CRC32C Parallel Computation Optimization on ARM

From

Michael Paquier

Date:

24 October 2023, 22:17:55

On Wed, Oct 25, 2023 at 12:37:45AM +0300, Heikki Linnakangas wrote:
> On 25/10/2023 00:18, Nathan Bossart wrote:
>> Actually, since the pg_waldump benchmark likely only involves very small
>> WAL records, it would make sense that there isn't much difference.
>> *facepalm*
>
> No need to guess, pg_waldump -z will tell you what the record size is. And
> you can vary it by changing the checkpoint interval and/or pgbench scale
> factor: if you checkpoint frequently or if the database is larger, you get
> more full-page images which makes the records larger on average, and vice
> versa.

If you are looking at computing the CRC of records with arbitrary
sizes, why not just generating a series with
pg_logical_emit_message() before doing a comparison with pg_waldump or
a custom replay loop to go through the records?  At least it would
make the results more predictible.
--
Michael

Attachment

signature.asc

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

25 October 2023, 01:45:39

On Wed, Oct 25, 2023 at 07:17:55AM +0900, Michael Paquier wrote:
> If you are looking at computing the CRC of records with arbitrary
> sizes, why not just generating a series with
> pg_logical_emit_message() before doing a comparison with pg_waldump or
> a custom replay loop to go through the records?  At least it would
> make the results more predictible.

I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
without the patch and around 7.4 seconds with it (an 8% improvement).
pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
patch and around 2.4 seconds with it (a 25% improvement).

Given the performance characteristics and relative simplicity of the patch,
I think this could be worth doing.  I suspect we'll want to do something
similar for x86, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

25 October 2023, 03:38:20

Thanks for your suggestion, this is the modified patch and two test files.

-----Original Message-----
From: Michael Paquier <michael@paquier.xyz>
Sent: Friday, October 20, 2023 4:19 PM
To: Xiang Gao <Xiang.Gao@arm.com>
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: CRC32C Parallel Computation Optimization on ARM

On Fri, Oct 20, 2023 at 07:08:58AM +0000, Xiang Gao wrote:
> This patch uses a parallel computing optimization algorithm to improve
> crc32c computing performance on ARM. The algorithm comes from Intel
> whitepaper:
> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
> into three equal-sized blocks.Three parallel blocks (crc0, crc1,
> crc2) for 1024 Bytes.One Block: 42(BLK_LENGTH) * 8(step length:
> crc32c_u64) bytes
>
> Crc32c unitest:
> https://gist.github.com/gaoxyt/138fd53ca1eead8102eeb9204067f7e4
> Crc32c benchmark:
> https://gist.github.com/gaoxyt/4506c10fc06b3501445e32c4257113e9
> It gets ~2x speedup compared to linear Arm crc32c instructions.

Interesting.  Could you attached to this thread the test files you used and the results obtained please?  If this data
getsdeleted from github, then it would not be possible to refer back to what you did at the related benchmark results. 

Note that your patch is forgetting about meson; it just patches ./configure.
--
Michael
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

25 October 2023, 15:43:25

+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)

It looks like most of this function is duplicated from
pg_comp_crc32c_armv8().  I understand that we probably need a separate
function because of the runtime check, but perhaps we could create a common
static inline helper function with a branch for when vmull_p64() can be
used.  It's callers would then just provide a boolean to indicate which
branch to take.

+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK"
=x"1"); then
 
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+    USE_ARMV8_VMULL=1
+  fi
+fi

Hm.  I wonder if we need to switch to a runtime check in some cases.  For
example, what happens if the ARMv8 intrinsics used today are found with the
default compiler flags, but vmull_p64() is only available if
-march=armv8-a+crypto is added?  It looks like the precedent is to use a
runtime check if we need extra CFLAGS to produce code that uses the
intrinsics.

Separately, I wonder if we should just always do runtime checks for the CRC
stuff whenever we can produce code with the intrinics, regardless of
whether we need extra CFLAGS.  The check doesn't look terribly expensive,
and it might allow us to simplify the code a bit (especially now that we
support a few different architectures).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

26 October 2023, 07:28:35

On Wed,  25 Oct,  2023 at 10:43:25 -0500, Nathan Bossart wrote:
>+pg_crc32c
>+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)

>It looks like most of this function is duplicated from
>pg_comp_crc32c_armv8().  I understand that we probably need a separate
>function because of the runtime check, but perhaps we could create a common
>static inline helper function with a branch for when vmull_p64() can be
>used.  It's callers would then just provide a boolean to indicate which
>branch to take.

I have modified and remade the patch.

>+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
>+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test
x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK"= x"1"); then 
>+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
>+    USE_ARMV8_VMULL=1
>+  fi
>+fi

>Hm.  I wonder if we need to switch to a runtime check in some cases.  For
>example, what happens if the ARMv8 intrinsics used today are found with the
>default compiler flags, but vmull_p64() is only available if
>-march=armv8-a+crypto is added?  It looks like the precedent is to use a
>runtime check if we need extra CFLAGS to produce code that uses the
>intrinsics.

We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the compilation can be successful.
A runtime check will be done when choosing which algorithm.
You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

>Separately, I wonder if we should just always do runtime checks for the CRC
>stuff whenever we can produce code with the intrinics, regardless of
>whether we need extra CFLAGS.  The check doesn't look terribly expensive,
>and it might allow us to simplify the code a bit (especially now that we
>support a few different architectures).

Yes, I think so. USE_ARMV8_CRC32C only means that the compilation is successful,
and it does not guarantee that it can run correctly on the local machine.
Therefore, a runtime check is required during actual operation.
Based on the principle of minimal changes, we plan to fix it in the next patch.
If the community agrees, we will continue to improve it later, such as merging x86 and arm code, etc.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

0003-crc32c-parallel-computation-optimization-on-arm.patch

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

26 October 2023, 08:53:31

On  Tue,  24 Oct,  2023 20:45:39PM -0500, Nathan Bossart wrote:
>I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
>without the patch and around 7.4 seconds with it (an 8% improvement).
>pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
>patch and around 2.4 seconds with it (a 25% improvement).

Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Re: CRC32C Parallel Computation Optimization on ARM

From

Bharath Rupireddy

Date:

26 October 2023, 09:06:44

On Thu, Oct 26, 2023 at 2:23 PM Xiang Gao <Xiang.Gao@arm.com> wrote:
>
> On  Tue,  24 Oct,  2023 20:45:39PM -0500, Nathan Bossart wrote:
> >I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
> >without the patch and around 7.4 seconds with it (an 8% improvement).
> >pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
> >patch and around 2.4 seconds with it (a 25% improvement).
>
> Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!

Here's a script that I use to generate WAL records of various sizes,
change it to taste if useful:

for m in 16 64 256 1024 4096 8192 16384
do
    echo "Start of run with WAL size \$m bytes at:"
    date
    echo "SELECT pg_logical_emit_message(true, 'mymessage',
repeat('d', \$m));" >> $JUMBO/scripts/dumbo\$m.sql
    for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096
    do
      $PGWORKSPACE/pgbench -n postgres -c\$c -j\$c -T60 -f
$JUMBO/scripts/dumbo\$m.sql > $JUMBO/results/dumbo\$m:\$c.out
    done
    echo "End of run with WAL size \$m bytes at:"
    date
    echo "\n"
done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

26 October 2023, 16:37:52

On Thu, Oct 26, 2023 at 07:28:35AM +0000, Xiang Gao wrote:
> On Wed,  25 Oct,  2023 at 10:43:25 -0500, Nathan Bossart wrote:
>>+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
>>+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || test
x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK"= x"1"); then
 
>>+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
>>+    USE_ARMV8_VMULL=1
>>+  fi
>>+fi
> 
>>Hm.  I wonder if we need to switch to a runtime check in some cases.  For
>>example, what happens if the ARMv8 intrinsics used today are found with the
>>default compiler flags, but vmull_p64() is only available if
>>-march=armv8-a+crypto is added?  It looks like the precedent is to use a
>>runtime check if we need extra CFLAGS to produce code that uses the
>>intrinsics.
> 
> We consider that a runtime check needs to be done in any scenario.
> Here we only confirm that the compilation can be successful.
> A runtime check will be done when choosing which algorithm.
> You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

Oh.  Looking again, I see that we are using a runtime check for ARM in all
cases with this patch.  If so, maybe we should just remove
USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
opportunities to simplify things, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

26 October 2023, 20:18:13

On Thu, Oct 26, 2023 at 08:53:31AM +0000, Xiang Gao wrote:
> On  Tue,  24 Oct,  2023 20:45:39PM -0500, Nathan Bossart wrote:
>>I tried this.  pg_waldump on 2 million ~8kB records took around 8.1 seconds
>>without the patch and around 7.4 seconds with it (an 8% improvement).
>>pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
>>patch and around 2.4 seconds with it (a 25% improvement).
> 
> Could you please provide details on how to generate these 8kB size or 16kB size data? Thanks!

I did something like

    do $$
    begin
        for i in 1..1000000
        loop
            perform pg_logical_emit_message(false, 'test', repeat('0123456789', 800));
        end loop;
    end;
    $$;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

27 October 2023, 07:01:10

On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>> We consider that a runtime check needs to be done in any scenario.
>> Here we only confirm that the compilation can be successful.
> >A runtime check will be done when choosing which algorithm.
> >You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.

>Oh.  Looking again, I see that we are using a runtime check for ARM in all
>cases with this patch.  If so, maybe we should just remove
>USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
>USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
>opportunities to simplify things, too.

Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

0004-crc32c-parallel-computation-optimization-on-arm.patch

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

30 October 2023, 16:21:43

On Fri, Oct 27, 2023 at 07:01:10AM +0000, Xiang Gao wrote:
> On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>>> We consider that a runtime check needs to be done in any scenario.
>>> Here we only confirm that the compilation can be successful.
>> >A runtime check will be done when choosing which algorithm.
>> >You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.
> 
>>Oh.  Looking again, I see that we are using a runtime check for ARM in all
>>cases with this patch.  If so, maybe we should just remove
>>USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
>>USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
>>opportunities to simplify things, too.
> 
> Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.

Thanks.  I went ahead and split this prerequisite part out to a separate
thread [0] since it's sort-of unrelated to your proposal here.  It's not
really a prerequisite, but I do think it will simplify things a bit.

[0] https://postgr.es/m/20231030161706.GA3011%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

31 October 2023, 20:48:21

On Mon, Oct 30, 2023 at 11:21:43AM -0500, Nathan Bossart wrote:
> On Fri, Oct 27, 2023 at 07:01:10AM +0000, Xiang Gao wrote:
>> On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>>>> We consider that a runtime check needs to be done in any scenario.
>>>> Here we only confirm that the compilation can be successful.
>>> >A runtime check will be done when choosing which algorithm.
>>> >You can think of us as merging USE_ARMV8_VMULL and USE_ARMV8_VMULL_WITH_RUNTIME_CHECK into USE_ARMV8_VMULL.
>> 
>>>Oh.  Looking again, I see that we are using a runtime check for ARM in all
>>>cases with this patch.  If so, maybe we should just remove
>>>USE_ARV8_CRC32C_WITH_RUNTIME_CHECK in a prerequisite patch (and have
>>>USE_ARMV8_CRC32C always do the runtime check).  I suspect there are other
>>>opportunities to simplify things, too.
>> 
>> Yes, I have been removed USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK in this patch.
> 
> Thanks.  I went ahead and split this prerequisite part out to a separate
> thread [0] since it's sort-of unrelated to your proposal here.  It's not
> really a prerequisite, but I do think it will simplify things a bit.

Per the other thread [0], we should try to avoid the runtime check when
possible, as it seems to produce a small regression.  This means that if
the ARMv8 CRC instructions are found with the default compiler flags, we
can only use vmull_p64() if it can also be used with the default flags.
Otherwise, we can just do the runtime check.

[0] https://postgr.es/m/2620794.1698783160%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

02 November 2023, 06:17:20

On Tue, 31 Oct 2023 15:48:21PM -0500, Nathan Bossart wrote:
>> Thanks.  I went ahead and split this prerequisite part out to a separate
>> thread [0] since it's sort-of unrelated to your proposal here.  It's not
>> really a prerequisite, but I do think it will simplify things a bit.

>Per the other thread [0], we should try to avoid the runtime check when
>possible, as it seems to produce a small regression.  This means that if
>the ARMv8 CRC instructions are found with the default compiler flags, we
>can only use vmull_p64() if it can also be used with the default flags.
>Otherwise, we can just do the runtime check.

>[0] https://postgr.es/m/2620794.1698783160%40sss.pgh.pa.us

After reading the discussion, I understand that in order to avoid performance
regression in some instances, we need to try our best to avoid runtime checks.
I don't know if I understand it correctly.
if so, we need to check whether to use the ARM CRC32C and VMULL instruction
directly or with runtime check. There will be many scenarios here and the code
will be more complicated.
Could you please give me some suggestions about how to refine this patch?
Thanks very much!

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

02 November 2023, 14:35:50

On Thu, Nov 02, 2023 at 06:17:20AM +0000, Xiang Gao wrote:
> After reading the discussion, I understand that in order to avoid performance
> regression in some instances, we need to try our best to avoid runtime checks.
> I don't know if I understand it correctly.

The idea is that we don't want to start forcing runtime checks on builds
where we aren't already doing runtime checks.  IOW if the compiler can use
the ARMv8 CRC instructions with the default compiler flags, we should only
use vmull_p64() if it can also be used with the default compiler flags.  I
suspect this limitation sounds worse than it actually is in practice.  The
vast majority of the buildfarm uses runtime checks, and at least some of
the platforms that don't, such as the Apple M-series machines, seem to
include +crypto by default.

Of course, if a compiler picks up +crc but not +crypto in its defaults, we
could lose the vmull_p64() optimization on that platform.  But as noted in
the other thread, we can revisit if these kinds of hypothetical situations
become reality.

> Could you please give me some suggestions about how to refine this patch?

Of course.  I think we'll ultimately want to independently check for the
availability of the new instruction like we do for the other sets of
intrinsics:

    PGAC_ARMV8_VMULL_INTRINSICS([])
    if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
        PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
    fi

My current thinking is that we'll want to add USE_ARMV8_VMULL and
USE_ARMV8_VMULL_WITH_RUNTIME_CHECK and use those to decide exactly what to
compile.  I'll admit I haven't fully thought through every detail yet, but
I'm cautiously optimistic that we can avoid too much complexity in the
autoconf/meson scripts.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

03 November 2023, 10:46:57

On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:

>On Thu, Nov 02, 2023 at 06:17:20AM +0000, Xiang Gao wrote:
>> After reading the discussion, I understand that in order to avoid performance
>> regression in some instances, we need to try our best to avoid runtime checks.
> >I don't know if I understand it correctly.

>The idea is that we don't want to start forcing runtime checks on builds
>where we aren't already doing runtime checks.  IOW if the compiler can use
>the ARMv8 CRC instructions with the default compiler flags, we should only
>use vmull_p64() if it can also be used with the default compiler flags.  I
>suspect this limitation sounds worse than it actually is in practice.  The
>vast majority of the buildfarm uses runtime checks, and at least some of
>the platforms that don't, such as the Apple M-series machines, seem to
>include +crypto by default.

>Of course, if a compiler picks up +crc but not +crypto in its defaults, we
>could lose the vmull_p64() optimization on that platform.  But as noted in
>the other thread, we can revisit if these kinds of hypothetical situations
>become reality.

>> Could you please give me some suggestions about how to refine this patch?

>Of course.  I think we'll ultimately want to independently check for the
>availability of the new instruction like we do for the other sets of
>intrinsics:
>
>       PGAC_ARMV8_VMULL_INTRINSICS([])
>       if test x"$pgac_armv8_vmull_intrinsics" != x"yes"; then
>               PGAC_ARMV8_VMULL_INTRINSICS([-march=armv8-a+crypto])
>       fi
>
>My current thinking is that we'll want to add USE_ARMV8_VMULL and
>USE_ARMV8_VMULL_WITH_RUNTIME_CHECK and use those to decide exactly what to
>compile.  I'll admit I haven't fully thought through every detail yet, but
>I'm cautiously optimistic that we can avoid too much complexity in the
>autoconf/meson scripts.

Thank you so much!
This is the newest patch, I think the code for which crc algorithm to choose is a bit complicated. Maybe we can just
useUSE_ARMV8_VMULL only, and do runtime checks on the vmull_p64 instruction at all times. This will not affect the
existingbuilds, because this is a new instruction and new logic. In addition, it  can also reduce the complexity of the
code.
Very much looking forward to receiving your suggestions, thank you!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

0005-crc32c-parallel-computation-optimization-on-arm.patch

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

06 November 2023, 19:16:13

On Fri, Nov 03, 2023 at 10:46:57AM +0000, Xiang Gao wrote:
> On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:
>> The idea is that we don't want to start forcing runtime checks on builds
>> where we aren't already doing runtime checks.  IOW if the compiler can use
>> the ARMv8 CRC instructions with the default compiler flags, we should only
>> use vmull_p64() if it can also be used with the default compiler flags.
>
> This is the newest patch, I think the code for which crc algorithm to
> choose is a bit complicated. Maybe we can just use USE_ARMV8_VMULL only,
> and do runtime checks on the vmull_p64 instruction at all times. This
> will not affect the existing builds, because this is a new instruction
> and new logic. In addition, it  can also reduce the complexity of the
> code.

I don't think we can.  AFAICT a runtime check necessitates a function
pointer or a branch, both of which incurred an impact on performance in my
tests.  It looks like this latest patch still does the runtime check even
for the USE_ARMV8_CRC32C case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

07 November 2023, 08:05:45

On Mon, 6 Nov 2023 13:16:13PM -0600, Nathan Bossart wrote:
>>> The idea is that we don't want to start forcing runtime checks on builds
>>>where we aren't already doing runtime checks.  IOW if the compiler can use
>>>the ARMv8 CRC instructions with the default compiler flags, we should only
>>>use vmull_p64() if it can also be used with the default compiler flags.
>>
>>This is the newest patch, I think the code for which crc algorithm to
>>choose is a bit complicated. Maybe we can just use USE_ARMV8_VMULL only,
>>and do runtime checks on the vmull_p64 instruction at all times. This
>>will not affect the existing builds, because this is a new instruction
>>and new logic. In addition, it  can also reduce the complexity of the
>>code.

>I don't think we can.  AFAICT a runtime check necessitates a function
>pointer or a branch, both of which incurred an impact on performance in my
>tests.  It looks like this latest patch still does the runtime check even
>for the USE_ARMV8_CRC32C case.

I think I understand what you mean, this is the latest patch. Thank you!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

0006-crc32c-parallel-computation-optimization-on-arm.patch

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

10 November 2023, 16:36:08

On Tue, Nov 07, 2023 at 08:05:45AM +0000, Xiang Gao wrote:
> I think I understand what you mean, this is the latest patch. Thank you!

Thanks for the new patch.

+# PGAC_ARMV8_VMULL_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. These instructions
+# were first introduced in ARMv8 crypto Extension.

I wonder if it'd be better to call this PGAC_ARMV8_CRYPTO_INTRINSICS since
this check seems to indicate the presence of +crypto.  Presumably there are
other instructions in this extension that could be used elsewhere, in which
case we could reuse this.

+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
+if test x"$USE_ARMV8_VMULL" = x"" && test x"$USE_ARMV8_VMULL_WITH_RUNTIME_CHECK" = x"" && (test x"$USE_ARMV8_CRC32C" =
x"1"|| test x"$USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK" = x"1"); then
 
+  if test x"$pgac_armv8_vmull_intrinsics" = x"yes" && test x"$CFLAGS_VMULL" = x""; then
+    USE_ARMV8_VMULL=1
+  else
+    if test x"$pgac_armv8_vmull_intrinsics" = x"yes"; then
+      USE_ARMV8_VMULL_WITH_RUNTIME_CHECK=1
+    fi
+  fi
+fi

I'm not sure I see the need to check USE_ARMV8_CRC32C* when setting these.
Couldn't we set them solely on the results of our
PGAC_ARMV8_VMULL_INTRINSICS check?  It looks like this is what you are
doing in meson.build already.

+extern pg_crc32c pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len);

nitpick: Maybe pg_comp_crc32_armv8_parallel()?

-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)

Why are these lines deleted?

-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],

What is the purpose of this change?

+__attribute__((target("+crc+crypto")))

I'm not sure we can assume that all compilers will understand this, and I'm
not sure we need it.

+    if (use_vmull)
+    {
+/*
+ * Crc32c parallel computation Input data is divided into three
+ * equal-sized blocks. Block length : 42 words(42 * 8 bytes).
+ * CRC0: 0 ~ 41 * 8,
+ * CRC1: 42 * 8 ~ (42 * 2 - 1) * 8,
+ * CRC2: 42 * 2 * 8 ~ (42 * 3 - 1) * 8.
+ */

Shouldn't we surround this with #ifdefs for USE_ARMV8_VMULL*?

     if (pg_crc32c_armv8_available())
+    {
+#if defined(USE_ARMV8_VMULL)
+        pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+#elif defined(USE_ARMV8_VMULL_WITH_RUNTIME_CHECK)
+        if (pg_vmull_armv8_available())
+        {
+            pg_comp_crc32c = pg_comp_crc32c_with_vmull_armv8;
+        }
+        else
+        {
+            pg_comp_crc32c = pg_comp_crc32c_armv8;
+        }
+#else
         pg_comp_crc32c = pg_comp_crc32c_armv8;
+#endif
+    }

IMO it'd be better to move the #ifdefs into the functions so that we can
simplify this to something like

    if (pg_crc32c_armv8_available())
    {
        if (pg_crc32c_armv8_crypto_available())
            pg_comp_crc32c = pg_comp_crc32c_armv8_parallel;
        else
            pg_comp_crc32c = pg_comp_crc32c_armv8;
    else
        pc_comp_crc32c = pg_comp_crc32c_sb8;

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

22 November 2023, 10:16:44

On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:

>-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
>-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
>-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
>-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
>
>Why are these lines deleted?
>
>-  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
>+  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
>
>What is the purpose of this change?

Because I added `__attribute__((target("+crc+crypto")))` before the functions that require crc extension and crypto
extension,so they are removed here. 

>+__attribute__((target("+crc+crypto")))
>
>I'm not sure we can assume that all compilers will understand this, and I'm
>not sure we need it.

CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported, __attribute__ is also supported.
In addition, I am not sure about the source file pg_crc32c_armv8.c, if CFLAGS_CRC and CFLAGS_CRYPTO are needed at the
sametime, how should it be expressed in the makefile? 

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

0007-crc32c-parallel-computation-optimization-on-arm.patch

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

22 November 2023, 21:06:18

On Wed, Nov 22, 2023 at 10:16:44AM +0000, Xiang Gao wrote:
> On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:
>>+__attribute__((target("+crc+crypto")))
>>
>>I'm not sure we can assume that all compilers will understand this, and I'm
>>not sure we need it.
> 
> CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported,
> __attribute__ is also supported.

IMHO we should stick with CFLAGS_CRC for now.  If we want to switch to
using __attribute__((target("..."))), I think we should do so in a separate
patch.  We are cautious about checking the availability of an attribute
before using it (see c.h), and IIUC we'd need to verify that this works for
all supported compilers that can target ARM before removing CFLAGS_CRC
here.

> In addition, I am not sure about the source file pg_crc32c_armv8.c, if
> CFLAGS_CRC and CFLAGS_CRYPTO are needed at the same time, how should it
> be expressed in the makefile?

pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

23 November 2023, 08:05:26

On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:

>> On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:
>>>+__attribute__((target("+crc+crypto")))
>>>
>>>I'm not sure we can assume that all compilers will understand this, and I'm
>>>not sure we need it.
>>
>> CFLAGS_CRC is "-march=armv8-a+crc". Generally, if -march is supported,
>> __attribute__ is also supported.

>IMHO we should stick with CFLAGS_CRC for now.  If we want to switch to
>using __attribute__((target("..."))), I think we should do so in a separate
>patch.  We are cautious about checking the availability of an attribute
>before using it (see c.h), and IIUC we'd need to verify that this works for
>all supported compilers that can target ARM before removing CFLAGS_CRC
>here.

I agree.

>> In addition, I am not sure about the source file pg_crc32c_armv8.c, if
>> CFLAGS_CRC and CFLAGS_CRYPTO are needed at the same time, how should it
>> be expressed in the makefile?
>
>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}

It does not work correctly. CFLAGS ='-march=armv8-a+crc, -march=armv8-a+crypto', what actually works is
'-march=armv8-a+crypto'.

We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,

If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
  CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
fi

then in makefile,
pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }

And same thing in meson.build.  In src/port/meson.build,

replace_funcs_pos = [
  # arm / aarch64
  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C', 'crc_crypto'],
  ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc_crypto'],
  ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc_crypto'],
  ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
]
'pg_crc32c_armv8' also needs 'crc_crypto' when 'USE_ARMV8_CRC32C'.

Looking forward to your feedback, thanks!

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

30 November 2023, 20:54:26

On Thu, Nov 23, 2023 at 08:05:26AM +0000, Xiang Gao wrote:
> On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:
>>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}
> 
> It does not work correctly. CFLAGS ='-march=armv8-a+crc,
> -march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.
> 
> We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,
> 
> If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
>   CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
> fi
> 
> then in makefile,
> pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }

Ah, I see.  We need to append +crc and/or +crypto based on what the
compiler understands.  That seems fine to me...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

From

Xiang Gao

Date:

04 December 2023, 07:27:01

On Date: Thu, 30 Nov 2023 14:54:26PM -0600, Nathan Bossart wrote:
>>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}
>>
>> It does not work correctly. CFLAGS ='-march=armv8-a+crc,
>> -march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.
>>
>> We set a new variable CLAGS_CRC_CRYPTO,In configure.ac,
>>
>> If test x"$CFLAGS_CRC" != x"" || test x"CFLAGS_CRYPTO" != x""; then
>>   CLAGS_CRC_CRYPTO = '-march=armv8-a+crc+crypto'
>> fi
>>
>> then in makefile,
>> pg_crc32c_armv8.o: CFLAGS +=${ CLAGS_CRC_CRYPTO }
>
>Ah, I see.  We need to append +crc and/or +crypto based on what the
>compiler understands.  That seems fine to me...

This is the latest patch. Looking forward to your feedback, thanks!
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you
arenot the intended recipient, please notify the sender immediately and do not disclose the contents to any other
person,use it for any purpose, or store or copy the information in any medium. Thank you.

Attachment

0008-crc32c-parallel-computation-optimization-on-arm.patch

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

05 December 2023, 04:18:09

On Mon, Dec 04, 2023 at 07:27:01AM +0000, Xiang Gao wrote:
> This is the latest patch. Looking forward to your feedback, thanks!

Thanks for the new patch.  I am hoping to spend much more time on this in
the near future...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: CRC32C Parallel Computation Optimization on ARM

From

Dmitry Dolgov

Date:

01 December 2024, 22:00:14

> On Mon, Dec 04, 2023 at 10:18:09PM -0600, Nathan Bossart wrote:
>
> Thanks for the new patch.  I am hoping to spend much more time on this in
> the near future...

Hi,

The patch looks interesting, having around 8% improvement on that sounds
attractive. Nathan, do you plan to come back to it and finish the
review?

One side note, I think it would be great to properly cite the white
paper the patch is referring to. Besides paying some respect to the
authors, it will also make it easier to actually find it. After a quick
search I found only some references to [1], but this link doesn't seem
to be available anymore.

[1]:
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

Re: CRC32C Parallel Computation Optimization on ARM

From

John Naylor

Date:

04 December 2024, 03:15:19

On Mon, Dec 4, 2023 at 2:27 PM Xiang Gao <Xiang.Gao@arm.com> wrote:
>
> [v8 patch]

I have a couple quick thoughts on this:

1. I looked at a couple implementations of this idea, and found that
the constants used in the carryless multiply are tied to the length of
the blocks. With a lookup table we can do the 3-way algorithm on any
portion of a full block length, rather than immediately fall to doing
CRC serially. That would be faster on average. See for example
https://github.com/komrad36/CRC/tree/master , but I don't think we
actually have to fully unroll the loop like they do there.

2. With the above, we can use a larger full block size, and so on
average less time would be spent in the carryless multiply. With that,
we could possibly get away with an open coded loop in normal C rather
than a new intrinsic (also found in the above repo). That would be
more portable.

--
John Naylor
Amazon Web Services.

Re: CRC32C Parallel Computation Optimization on ARM

From

Nathan Bossart

Date:

11 December 2024, 19:54:27

On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:
> I added a port to x86 and poked at it, with the intent to have an easy
> on-ramp to that at least accelerates computation of CRCs on FPIs.
> 
> The 0008 patch only worked on chunks of 1024 at a time. At that size,
> the presence of hardware carryless multiplication is not that
> important. I removed the hard-coded constants in favor of a lookup
> table, so now it can handle anything up to 8400 bytes in a single
> pass.
> 
> There are still some "taste" issues, but I like the overall shape here
> and how light it was. With more hardware support, we can go much lower
> than 1024 bytes, but that can be left for future work.

Nice.  I'm curious how this compares to both the existing implementations
and the proposed ones that require new intrinsics.  I like the idea of
avoiding new runtime and config checks, especially if the performance is
somewhat comparable for the most popular cases (i.e., dozens of bytes to a
few thousand bytes).

If we still want to add new intrinsics, would it be easy enough to add them
on top of this patch?  Or would it require further restructuring?

-- 
nathan

Re: CRC32C Parallel Computation Optimization on ARM

From

John Naylor

Date:

12 December 2024, 05:30:59

On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:

> > and how light it was. With more hardware support, we can go much lower
> > than 1024 bytes, but that can be left for future work.
>
> Nice.  I'm curious how this compares to both the existing implementations
> and the proposed ones that require new intrinsics.  I like the idea of
> avoiding new runtime and config checks, especially if the performance is
> somewhat comparable for the most popular cases (i.e., dozens of bytes to a
> few thousand bytes).

With 8k inputs on x86 its fairly close to 3x faster than master.

I wasn't very clear, but v9 still has a cutoff of 1008 bytes just to
copy from 0008, but on a slightly old machine the crossover point is
about 400-600 bytes. Doing microbenchmarks that hammer on single
instructions is very finicky, so I don't trust these numbers much.

With hardware CLMUL, I'm guessing cutoff would be between 120 and 192
bytes (must be a multiple of 24 -- 3 words), and would depend on
architecture. Arm has an advantage that vmull_p64() operates on
scalars, but on x86 the corresponding operation is
_mm_clmulepi64_si128() , and there's a bit of shuffling in and out of
vector registers.

> If we still want to add new intrinsics, would it be easy enough to add them
> on top of this patch?  Or would it require further restructuring?

I'm still trying to wrap my head around how function selection works
after commit 4b03a27fafc , but it could be something like this on x86:

#if defined(__has_attribute) && __has_attribute (target)

pg_attribute_target("sse4.2,pclmul")
pg_comp_crc32c_sse42
{
  <big loop with special case for end>
  <hardware carryless multiply>
  <tail>
}

#endif

pg_attribute_target("sse4.2")
pg_comp_crc32c_sse42
{
  <big loop>
  <software carryless multiply>
  <tail>
}

...where we have the tail part in a separate function for readability.

On Arm it might have to be as complex as in 0008, since as you've
mentioned, compiler support for the needed attributes is still pretty
new.

--
John Naylor
Amazon Web Services

Re: CRC32C Parallel Computation Optimization on ARM

From

John Naylor

Date:

18 December 2024, 09:59:32

On Mon, Dec 2, 2024 at 2:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> One side note, I think it would be great to properly cite the white
> paper the patch is referring to. Besides paying some respect to the
> authors, it will also make it easier to actually find it. After a quick
> search I found only some references to [1], but this link doesn't seem
> to be available anymore.

I found an archive:

https://web.archive.org/web/20220802143127/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

One thing I noticed is this part:

"The basic concepts in this paper are derived from and explained in detail in
the patents and pending applications [4][5][6]."
...
[4] Determining a Message Residue, Gopal et al. United States Patent 7,886,214
[5] Determining a Message Residue Gueron et al. United States Patent Application
20090019342
[6] Determining a Message Residue Gopal et al. United States Patent Application
20090158132

Searching for the first one gives

https://patents.google.com/patent/US20090019342

which says
"Status Expired - Fee Related
2029-09-03 Adjusted expiration"

On the other hand, looking at Linux kernel sources, it seems a patch
using this technique was contributed by Intel over a decade ago:

https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-asm_64.S

So one more thing to ask our friends at Intel.

--
John Naylor
Amazon Web Services

RE: CRC32C Parallel Computation Optimization on ARM

From

"Devulapalli, Raghuveer"

Date:

10 March, 23:36:04

Hi John, 

> On the other hand, looking at Linux kernel sources, it seems a patch using this
> technique was contributed by Intel over a decade ago:
> 
> https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
> asm_64.S
> 
> So one more thing to ask our friends at Intel.

Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on similar techniques to postgres. 

[1]
https://www.postgresql.org/message-id/PH8PR11MB8286F844321BA1DEEC518348FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
[2]
https://www.postgresql.org/message-id/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com

Raghuveer

Re: CRC32C Parallel Computation Optimization on ARM

From

John Naylor

Date:

11 March, 04:30:30

On Tue, Mar 11, 2025 at 3:36 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> Hi John,
>
> > On the other hand, looking at Linux kernel sources, it seems a patch using this
> > technique was contributed by Intel over a decade ago:
> >
> > https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
> > asm_64.S
> >
> > So one more thing to ask our friends at Intel.
>
> Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on similar techniques to postgres.
>
> [1]
https://www.postgresql.org/message-id/PH8PR11MB8286F844321BA1DEEC518348FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
> [2]
https://www.postgresql.org/message-id/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com

No, these are not similar at all. I gave you the paper name and the
patents cited therein here:

https://www.postgresql.org/message-id/CANWCAZbkt89_fVAaCAGBMznwA_xh%3D2Ci5q4GZytZHKjZAEjCRQ%40mail.gmail.com

--
John Naylor
Amazon Web Services

RE: CRC32C Parallel Computation Optimization on ARM

From

"Devulapalli, Raghuveer"

Date:

11 March, 20:46:04

Hi John, 

I am happy to submit a patch with a C fallback version that leverages the specific algorithm/technique mentioned in the
whitepaper to make it clear that Intel has contributed this specific technique to Postgres under Postgres license
terms. That should hopefully address any lingering concerns anyone may have w.r.t using this technique for the benefit
ofPostgres. 
 

Raghuveer

> -----Original Message-----
> From: John Naylor <johncnaylorls@gmail.com>
> Sent: Monday, March 10, 2025 6:31 PM
> To: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
> Cc: Dmitry Dolgov <9erthalion6@gmail.com>; Nathan Bossart
> <nathandbossart@gmail.com>; Xiang Gao <Xiang.Gao@arm.com>; Michael
> Paquier <michael@paquier.xyz>; pgsql-hackers@lists.postgresql.org
> Subject: Re: CRC32C Parallel Computation Optimization on ARM
> 
> On Tue, Mar 11, 2025 at 3:36 AM Devulapalli, Raghuveer
> <raghuveer.devulapalli@intel.com> wrote:
> >
> > Hi John,
> >
> > > On the other hand, looking at Linux kernel sources, it seems a patch
> > > using this technique was contributed by Intel over a decade ago:
> > >
> > > https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c
> > > -pcl-intel-
> > > asm_64.S
> > >
> > > So one more thing to ask our friends at Intel.
> >
> > Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on
> similar techniques to postgres.
> >
> > [1]
> > https://www.postgresql.org/message-
> id/PH8PR11MB8286F844321BA1DEEC51834
> > 8FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
> > [2]
> > https://www.postgresql.org/message-
> id/BL1PR11MB530401FA7E9B1CA432CF9DC
> > 3DC192@BL1PR11MB5304.namprd11.prod.outlook.com
> 
> No, these are not similar at all. I gave you the paper name and the patents cited
> therein here:
> 
> https://www.postgresql.org/message-
> id/CANWCAZbkt89_fVAaCAGBMznwA_xh%3D2Ci5q4GZytZHKjZAEjCRQ%40mail.g
> mail.com
> 
> --
> John Naylor
> Amazon Web Services

Re: CRC32C Parallel Computation Optimization on ARM

From

John Naylor

Date:

12 March, 09:51:08

On Wed, Mar 12, 2025 at 12:46 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> I am happy to submit a patch with a C fallback version that leverages the specific algorithm/technique mentioned in
thewhite paper to make it clear that Intel has contributed this specific technique to Postgres under Postgres license
terms. That should hopefully address any lingering concerns anyone may have w.r.t using this technique for the benefit
ofPostgres. 

Thanks for offering, but I'm unclear if that's actually necessary. I'm
still confused as to what the status of the patents are. From your
last response:

> Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on similar techniques to postgres.

...this is a restatement of facts we already know. I'm guessing the
intended takeaway is "since Intel submitted an implementation to us
based on paper A, then we are free to separately also use a technique
from paper B (which cites patents)". I'd be delighted to hear that, if
that's what you found from talking to a legal team, but it's not clear
to me.

The original proposal that started this thread is below, and I'd like
to give that author credit for initiating that work, as long as there
is no legal issue with that:

https://www.postgresql.org/message-id/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com

--
John Naylor
Amazon Web Services

RE: CRC32C Parallel Computation Optimization on ARM

From

"Devulapalli, Raghuveer"

Date:

12 March, 20:49:50

> > Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on
> similar techniques to postgres.
> 
> ...this is a restatement of facts we already know. I'm guessing the intended
> takeaway is "since Intel submitted an implementation to us based on paper A,
> then we are free to separately also use a technique from paper B (which cites
> patents)". 

Yes. 

> The original proposal that started this thread is below, and I'd like to give that
> author credit for initiating that work

Yup, that should be fine. 

Raghuveer

Re: CRC32C Parallel Computation Optimization on ARM

From

John Naylor

Date:

18 March, 14:50:59

On Thu, Mar 13, 2025 at 12:50 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> > > Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on
> > similar techniques to postgres.
> >
> > ...this is a restatement of facts we already know. I'm guessing the intended
> > takeaway is "since Intel submitted an implementation to us based on paper A,
> > then we are free to separately also use a technique from paper B (which cites
> > patents)".
>
> Yes.
>
> > The original proposal that started this thread is below, and I'd like to give that
> > author credit for initiating that work
>
> Yup, that should be fine.

Thank you for confirming. I've attached v10, which has mostly
polishing and comment writing, and a draft commit message. The lookup
table and software carryless multiplication routine are still in
pg_crc32c_sb.c , which is now built unconditionally. That's good
foreshadowing of future pclmul/pmull support, as I've found building
that file everywhere makes some things simpler anyway. That file has
become a bit of a misnomer, and I've thought of renaming it to
*_common.c or perhaps *_fallback.c , since the addition from this
patch is still kind of a fallback where we won't have the hardware
needed for faster algorithms, as discussed elsewhere.

0002-3 puts the relevant parts into a header so that the hardware
details can be abstracted away. These would be squashed, but I've kept
them separate here for comparison.

--
John Naylor
Amazon Web Services

Attachment

Re: CRC32C Parallel Computation Optimization on ARM

From

Bruce Momjian

Date:

18 March, 20:54:52

On Wed, Mar 12, 2025 at 01:51:08PM +0700, John Naylor wrote:
> On Wed, Mar 12, 2025 at 12:46 AM Devulapalli, Raghuveer
> <raghuveer.devulapalli@intel.com> wrote:
> >
> > I am happy to submit a patch with a C fallback version that leverages the specific algorithm/technique mentioned in
thewhite paper to make it clear that Intel has contributed this specific technique to Postgres under Postgres license
terms. That should hopefully address any lingering concerns anyone may have w.r.t using this technique for the benefit
ofPostgres.
 
> 
> Thanks for offering, but I'm unclear if that's actually necessary. I'm
> still confused as to what the status of the patents are. From your
> last response:
> 
> > Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on similar techniques to postgres.
> 
> ...this is a restatement of facts we already know. I'm guessing the
> intended takeaway is "since Intel submitted an implementation to us
> based on paper A, then we are free to separately also use a technique
> from paper B (which cites patents)". I'd be delighted to hear that, if
> that's what you found from talking to a legal team, but it's not clear
> to me.

Contributing code is copyright, which is unrelated to patents.  I don't
think the Postgres community even has a method of accepting patent usage
grants.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.

Re: CRC32C Parallel Computation Optimization on ARM

From

John Naylor

Date:

02 April, 09:45:02

On Wed, Mar 19, 2025 at 12:54 AM Bruce Momjian <bruce@momjian.us> wrote:
> Contributing code is copyright, which is unrelated to patents.  I don't
> think the Postgres community even has a method of accepting patent usage
> grants.

Since the legal status is still unclear, I've marked the CF entry
Returned with Feedback.

--
John Naylor
Amazon Web Services