On Wed, Mar 18, 2026 at 10:34 AM Haibo Yan <tristan.yim@gmail.com> wrote: > > Hi John > > Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.
> Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics. > > Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?
I answered that in the email you replied to, re-quoted here:
> To follow-up for curiosity's sake, [1] says that Apple chips can issue > PMULL + EOR as a single uop if they are next to each other in the > instruction stream. > [1] https://dougallj.github.io/applecpu/firestorm.html
I don't know if that's relevant for current server hardware, so it could be pointless. I'm personally not a fan of inline assembly, but I also didn't yet want to put in the effort to alter generated code. I don't think it would be very hard to do, however.
Thanks, that makes sense as an explanation for why the inline asm is there today. But it also sounds like this is more of a temporary implementation choice than a conclusion that intrinsics are unsuitable. If so, I wonder whether it would be better to treat an intrinsics-based version as the preferred end state unless benchmarks show a clear regression.