I wrote:
> In the end, I want to add a length check so
> that inputs smaller than 80 bytes go straight to the scalar path.
> Above 80, after alignment adjustments in the preamble, that still
> guarantees at least one loop iteration in the vector path.
Attached is how that would look. The idea is that small inputs will
encounter fewer branches. It'd be tricky to prove a difference with a
benchmark, and I see this as just making the small-input path more
similar to PG 18, as a risk-avoidance maneuver.
--
John Naylor
Amazon Web Services