On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote:
> I haven't looked at the patches, but the graphs look good.
I spent some more time on these patches. Specifically, I reordered them to
demonstrate the effects on systems without AVX2 support. I've also added a
shortcut to jump to the one-by-one approach when there aren't many
elements, as the overhead becomes quite noticeable otherwise. Finally, I
ran the same benchmarks again on x86 and Arm out to 128 elements.
Overall, I think 0001 and 0002 are in decent shape, although I'm wondering
if it's possible to improve the style a bit. 0003 at least needs a big
comment in simd.h, and it might need a note in the documentation, too. If
the approach in this patch set seems reasonable, I'll spend some time on
that.
BTW I did try to add some other optimizations, such as processing remaining
elements with only one vector and trying to use the overlapping strategy
with more registers if we know there are relatively many remaining
elements. These other approaches all added a lot of complexity and began
hurting performance, and I've probably already spent way too much time
optimizing a linear search, so this is where I've decided to stop.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com