Re: vectorized CRC on ARM64 - Mailing list pgsql-hackers

From Haibo Yan
Subject Re: vectorized CRC on ARM64
Date
Msg-id C3ADF28D-E6D4-41D2-ADE2-C7DD53EA8A5C@gmail.com
Whole thread Raw
In response to Re: vectorized CRC on ARM64  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: vectorized CRC on ARM64
List pgsql-hackers
Hi John

Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.

On Jan 12, 2026, at 1:27 AM, John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, May 14, 2025 I wrote:

We did something similar for x86 for v18, and here is some progress
towards Arm support.

Coming back to this, since there's been recent interest in Arm support.

v2 is a rebase, with a few changes.

- I simplified it by leaving out the inlining for "assume CRC" builds,
since I wanted to avoid alignment considerations if I can. I think
always indirecting through a pointer will have less risk of
regressions in a realistic setting than for x86 since Arm chips
typically have low latency for carryless multiplication instructions.
With just a bit of code we can still use the direct call for small
constant inputs, so I did that to avoid regressions under WAL insert
lock.

- One coding idiom for a vector literal in the generated code was
giving pgindent indigestion, I so rewrote it using Neon intrinsics and
verified it in Godbolt.

0002: Like 3c6e8c12389 and in fact uses the same program to generate
the code, by specifying Neon instructions with the Arm "crypto"
extension instead. There are some interesting differences from x86
here as well:
- The upstream implementation chose to use inline assembly instead of
intrinsics for some reason. I initially thought that was a way to get
broader compiler support, but it turns out you still need to pass the
relevant flags to get the assembly to link.


Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics.

Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?



To follow-up for curiosity's sake, [1] says that Apple chips can issue
PMULL + EOR as a single uop if they are next to each other in the
instruction stream.

- I only have Meson support for now, since I used MacOS on CI to test.
That OS and compiler combination apparently targets the CRC extension,
but the PMULL instruction runtime check uses Linux-only headers, I
believe, so previously I hacked the choose function to return true for
testing. The choose function in 0002 is untested in this form.

This is still true, but now the CI hack lives in a separate
not-for-commit patch for clarity.

autoconf support is a WIP, and I will share that after I do some
testing on an Arm Linux instance.

[1] https://dougallj.github.io/applecpu/firestorm.html

--
John Naylor
Amazon Web Services
<v2-0001-Compute-CRC32C-on-ARM-using-the-Crypto-Extension-.patch><v2-0002-Force-testing-on-MacOS-CI-XXX-not-for-commit.patch>

Regards
Haibo

pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: CAST(... ON DEFAULT) - WIP build on top of Error-Safe User Functions
Next
From: Bertrand Drouvot
Date:
Subject: Re: relfilenode statistics