Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.
On Jan 12, 2026, at 1:27 AM, John Naylor <johncnaylorls@gmail.com> wrote:
On Wed, May 14, 2025 I wrote:
We did something similar for x86 for v18, and here is some progress towards Arm support.
Coming back to this, since there's been recent interest in Arm support.
v2 is a rebase, with a few changes.
- I simplified it by leaving out the inlining for "assume CRC" builds, since I wanted to avoid alignment considerations if I can. I think always indirecting through a pointer will have less risk of regressions in a realistic setting than for x86 since Arm chips typically have low latency for carryless multiplication instructions. With just a bit of code we can still use the direct call for small constant inputs, so I did that to avoid regressions under WAL insert lock.
- One coding idiom for a vector literal in the generated code was giving pgindent indigestion, I so rewrote it using Neon intrinsics and verified it in Godbolt.
0002: Like 3c6e8c12389 and in fact uses the same program to generate the code, by specifying Neon instructions with the Arm "crypto" extension instead. There are some interesting differences from x86 here as well: - The upstream implementation chose to use inline assembly instead of intrinsics for some reason. I initially thought that was a way to get broader compiler support, but it turns out you still need to pass the relevant flags to get the assembly to link.
Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics.
Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?
To follow-up for curiosity's sake, [1] says that Apple chips can issue PMULL + EOR as a single uop if they are next to each other in the instruction stream.
- I only have Meson support for now, since I used MacOS on CI to test. That OS and compiler combination apparently targets the CRC extension, but the PMULL instruction runtime check uses Linux-only headers, I believe, so previously I hacked the choose function to return true for testing. The choose function in 0002 is untested in this form.
This is still true, but now the CI hack lives in a separate not-for-commit patch for clarity.
autoconf support is a WIP, and I will share that after I do some testing on an Arm Linux instance.
-- John Naylor Amazon Web Services <v2-0001-Compute-CRC32C-on-ARM-using-the-Crypto-Extension-.patch><v2-0002-Force-testing-on-MacOS-CI-XXX-not-for-commit.patch>