Thread: Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

From
Aleksander Alekseev
Date:
Hi Eduard,

> Here I send a patch that adds a vectorized version of CRC32C for the
> IBM S390X hardware. I kindly ask for a review of the code and to pick
> it up in upstream postgresql.
> [...]
> Cheers and thanks to all for their work,

Thanks for submitting this patch.

Please register it on the nearest open commitfest [1] so that it
wouldn't be lost.

I didn't review the patch but wanted to point out that when it comes
to performance improvements it's typically useful to provide some
benchmarks.

[1]: https://commitfest.postgresql.org/53/

-- 
Best regards,
Aleksander Alekseev



On Wed, May 7, 2025 at 8:15 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> I didn't review the patch but wanted to point out that when it comes
> to performance improvements it's typically useful to provide some
> benchmarks.

+1 -- It's good to have concrete numbers for the commit message, and
also to verify improvement on short inputs. There is a test harness in
the  v7-0002 patch from here:

https://www.postgresql.org/message-id/CANWCAZaD5niydBF6q3V_cjApNV05cw-LpxxFtMbwDPLsz-PjbQ@mail.gmail.com

After building, run the "test-crc.sh" script here after executing
"CREATE EXTENSION test_crc32c;":

https://www.postgresql.org/message-id/CANWCAZahvhE-%2BhtZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ%40mail.gmail.com



--
John Naylor
Amazon Web Services



Hi,

So I worked on the algorithm to also work on buffers between 16-64
bytes. Then I ran the performance measurement on two
dataset[^raw_data_1] [^raw_data_2]. And created two diagrams
[^attachment].

my findings so far:

- the optimized crc32cvx is faster
- the sb8 performance is heavily depending on alignment (see the
ripples every 8 bytes)
- the 8 byte ripple is also visible in the vx implementation. As it can
only perform on 16 or 64 byte chunks, it will still use sb8 for the
remaining bytes.
- there is no obvious speed regression in the vx algorithm. Except
raw_data_2-28 which I assume is a fluke. I am sharing the system with a
bunch of other devs.


I hope this this is acceptable as performance measurement. However we
will setup a dedicated performance test and try to get precise numbers
without side-effects. But it may take some time until we get to that.

I'll post the update on the Code together with the other requested
updates.

cheers, Eddy



[^raw_data_1]
bytes    crc32c_sb8    crc32c_vx
4    6.54 ms        6.548 ms
8    4.476 ms    4.47 ms
10    7.346 ms    7.348 ms
12    10.955 ms    10.958 ms
14    14.548 ms    14.546 ms
16    6.837 ms    6.193 ms
32    12.23 ms    6.741 ms
64    22.826 ms    7.6 ms
80    28.536 ms    8.307 ms
96    34.426 ms    9.09 ms
112    40.295 ms    9.844 ms
128    46.053 ms    10.825 ms
144    51.868 ms    11.712 ms
160    65.91 ms    12.122 ms
176    71.649 ms    13.055 ms
192    77.465 ms    11.716 ms
208    83.286 ms    13.532 ms
224    88.991 ms    13.165 ms
240    94.875 ms    13.881 ms
256    100.653 ms    13.147 ms
8192    2967.477 ms    182.911 ms

[^raw_data_2]
bytes    crc32c_sb8    crc32c_vx
4    6.543 ms    6.536 ms
8    4.476 ms    4.47 ms
10    7.35 ms        7.345 ms
12    10.96 ms    10.954 ms
14    14.552 ms    14.588 ms
16    6.843 ms    6.189 ms
18    10.253 ms    9.814 ms
24    9.645 ms    9.924 ms
28    15.957 ms    17.211 ms
32    12.226 ms    6.726 ms
36    18.823 ms    14.484 ms
42    17.855 ms    14.271 ms
48    17.342 ms    7.344 ms
52    24.208 ms    15.306 ms
58    23.525 ms    14.695 ms
64    22.818 ms    7.593 ms



On Thu, 2025-05-08 at 05:32 +0700, John Naylor wrote:
> On Wed, May 7, 2025 at 8:15 PM Aleksander Alekseev
> <aleksander@timescale.com> wrote:
> >
> > I didn't review the patch but wanted to point out that when it
> > comes
> > to performance improvements it's typically useful to provide some
> > benchmarks.
>
> +1 -- It's good to have concrete numbers for the commit message, and
> also to verify improvement on short inputs. There is a test harness
> in
> the  v7-0002 patch from here:
>
> https://www.postgresql.org/message-id/CANWCAZaD5niydBF6q3V_cjApNV05cw-LpxxFtMbwDPLsz-PjbQ@mail.gmail.com
>  
>
> After building, run the "test-crc.sh" script here after executing
> "CREATE EXTENSION test_crc32c;":
>
> https://www.postgresql.org/message-id/CANWCAZahvhE-%2BhtZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ%40mail.gmail.com
>  
>
>
>
> --
> John Naylor
> Amazon Web Services

Attachment
On Tue, May 27, 2025 at 3:24 AM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:
> So I worked on the algorithm to also work on buffers between 16-64
> bytes. Then I ran the performance measurement on two
> dataset[^raw_data_1] [^raw_data_2]. And created two diagrams
> [^attachment].
>
> my findings so far:
>
> - the optimized crc32cvx is faster
> - the sb8 performance is heavily depending on alignment (see the
> ripples every 8 bytes)

To be precise, these all seem 8-byte aligned at a glance, and the
ripple is due to input length.

> - the 8 byte ripple is also visible in the vx implementation. As it can
> only perform on 16 or 64 byte chunks, it will still use sb8 for the
> remaining bytes.
> - there is no obvious speed regression in the vx algorithm. Except
> raw_data_2-28 which I assume is a fluke. I am sharing the system with a
> bunch of other devs.
>
>
> I hope this this is acceptable as performance measurement. However we
> will setup a dedicated performance test and try to get precise numbers
> without side-effects. But it may take some time until we get to that.

This already looks like a solid improvement at 32 bytes and above -- I
don't think we need less noisy numbers. Also for future reference,
please reply in-line. Thanks!

--
John Naylor
Amazon Web Services