Looks good, the code is more readable now.
> For both Neon and SVE, I do see improvements with looping over 4
> registers at a time, so IMHO it's worth doing so even if it performs the
> same as 2-register blocks on some hardware.
There was no regression on Graviton 3 when using the 4-register version so can keep it.
-Chiranmoy