On Wed, Mar 26, 2025 at 04:44:24PM -0500, Nathan Bossart wrote:
> IMHO these are acceptable results, at least for the use-cases I see in the
> tree. We might be able to minimize the difference between the Neon and SVE
> implementations on the low end with some additional code, but I'm really
> not sure if it's worth the effort.
I couldn't resist... I tried a variety of things (e.g., inlining the Neon
implementation to process the tail, jumping to the Neon implementation for
smaller inputs), and the only thing that seemed to be a clear win was to
add a 2-register block in the SVE implementations (like what is already
there for the Neon ones). In particular, that helps bring the Graviton3
SVE numbers closer to the Neon numbers for inputs between 8-16 8-byte
words.
I also noticed a silly mistake in 0003 that would cause us to potentially
skip part of the tail. That should be fixed now.
--
nathan