On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote:
> On Wed, Mar 20, 2024 at 01:57:54PM +0700, John Naylor wrote:
>> On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart
>> <nathandbossart@gmail.com> wrote:
>>> I tried to trim some of the branches, and came up with the attached patch.
>>> I don't think this is exactly what you were suggesting, but I think it's
>>> relatively close. My testing showed decent benefits from using 2 vectors
>>> when there aren't enough elements for 4, so I've tried to keep that part
>>> intact.
>>
>> I would caution against that if the benchmark is repeatedly running
>> against a static number of elements, because the branch predictor will
>> be right all the time (except maybe when it exits a loop, not sure).
>> We probably don't need to go to the trouble to construct a benchmark
>> with some added randomness, but we have be careful not to overfit what
>> the test is actually measuring.
>
> I don't mind removing the 2-register stuff if that's what you think we
> should do. I'm cautiously optimistic that it'd help more than the extra
> branch prediction might hurt, and it'd at least help avoid regressing the
> lower end for the larger AVX2 registers, but I probably won't be able to
> prove that without constructing another benchmark. And TBH I'm not sure
> it'll significantly impact any real-world workload, anyway.
Here's a new version of the patch set with the 2-register stuff removed,
plus a fresh run of the benchmark. The weird spike for AVX2 is what led me
down the 2-register path earlier.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com