On Wed, Mar 20, 2024 at 01:57:54PM +0700, John Naylor wrote:
> On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> I tried to trim some of the branches, and came up with the attached patch.
>> I don't think this is exactly what you were suggesting, but I think it's
>> relatively close. My testing showed decent benefits from using 2 vectors
>> when there aren't enough elements for 4, so I've tried to keep that part
>> intact.
>
> I would caution against that if the benchmark is repeatedly running
> against a static number of elements, because the branch predictor will
> be right all the time (except maybe when it exits a loop, not sure).
> We probably don't need to go to the trouble to construct a benchmark
> with some added randomness, but we have be careful not to overfit what
> the test is actually measuring.
I don't mind removing the 2-register stuff if that's what you think we
should do. I'm cautiously optimistic that it'd help more than the extra
branch prediction might hurt, and it'd at least help avoid regressing the
lower end for the larger AVX2 registers, but I probably won't be able to
prove that without constructing another benchmark. And TBH I'm not sure
it'll significantly impact any real-world workload, anyway.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com