On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote:
> I'm much happier about v5-0001. With a small tweak it would match what
> I had in mind:
>
> + if (nelem < nelem_per_iteration)
> + goto one_by_one;
>
> If this were "<=" then the for long arrays we could assume there is
> always more than one block, and wouldn't need to check if any elements
> remain -- first block, then a single loop and it's done.
>
> The loop could also then be a "do while" since it doesn't have to
> check the exit condition up front.
Good idea. That causes us to re-check all of the tail elements when the
number of elements is evenly divisible by nelem_per_iteration, but that
might be worth the trade-off.
> Yes, that spike is weird, because it seems super-linear. However, the
> more interesting question for me is: AVX2 isn't really buying much for
> the numbers covered in this test. Between 32 and 48 elements, and
> between 64 and 80, it's indistinguishable from SSE2. The jumps to the
> next shelf are postponed, but the jumps are just as high. From earlier
> system benchmarks, I recall it eventually wins out with hundreds of
> elements, right? Is that still true?
It does still eventually win, although not nearly to the same extent as
before. I extended the benchmark a bit to show this. I wouldn't be
devastated if we only got 0001 committed for v17, given these results.
> Further, now that the algorithm is more SIMD-appropriate, I wonder
> what doing 4 registers at a time is actually buying us for either SSE2
> or AVX2. It might just be a matter of scale, but that would be good to
> understand.
I'll follow up with these numbers shortly.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com