On Thu, Mar 28, 2024 at 10:03:04PM +0000, Amonson, Paul D wrote:
>> * I think we need to verify there isn't a huge performance regression for
>> smaller arrays. IIUC those will still require an AVX512 instruction or
>> two as well as a function call, which might add some noticeable overhead.
>
> Not considering your changes, I had already tested small buffers. At less
> than 512 bytes there was no measurable regression (there was one extra
> condition check) and for 512+ bytes it moved from no regression to some
> gains between 512 and 4096 bytes. Assuming you introduced no extra
> function calls, it should be the same.
Cool. I think we should run the benchmarks again to be safe, though.
>> I forgot to mention that I also want to understand whether we can
>> actually assume availability of XGETBV when CPUID says we support
>> AVX512:
>
> You cannot assume as there are edge cases where AVX-512 was found on
> system one during compile but it's not actually available in a kernel on
> a second system at runtime despite the CPU actually having the hardware
> feature.
Yeah, I understand that much, but I want to know how portable the XGETBV
instruction is. Unless I can assume that all x86_64 systems and compilers
support that instruction, we might need an additional configure check
and/or CPUID check. It looks like MSVC has had support for the _xgetbv
intrinsic for quite a while, but I'm still researching the other cases.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com