Re: add AVX2 support to simd.h - Mailing list pgsql-hackers

From John Naylor
Subject Re: add AVX2 support to simd.h
Date
Msg-id CANWCAZZgOS+f2W8+bFwHjYYT-6Fx4yP6dFpUy4oLfbXSgfTKLw@mail.gmail.com
Whole thread Raw
In response to add AVX2 support to simd.h  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: add AVX2 support to simd.h
List pgsql-hackers
On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

> Using the same benchmark as we did for the SSE2 linear searches in
> XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:

I've been antagonistic towards the patch itself, but it'd be more
productive if I paid some nuanced attention to the problem it's trying
to solve. First, I'd like to understand the benchmark a bit better.

>   writers    sse2    avx2     %
>       256    1195    1188    -1
>       512     928    1054   +14
>      1024     633     716   +13
>      2048     332     420   +27
>      4096     162     203   +25
>      8192     162     182   +12

There doesn't seem to be any benefit at 256 at all. Is that expected
and/or fine?

> It's been a while since I ran these benchmarks, but I vaguely recall also
> seeing something like a 50% improvement for a dedicated pg_lfind32()
> benchmark on long arrays.

The latest I see in
https://www.postgresql.org/message-id/20220808223254.GA1393216%40nathanxps13

writers  head  patch
8        672   680
16       639   664
32       701   689
64       705   703
128      628   653
256      576   627
512      530   584
768      450   536
1024     350   494

Here, the peak throughput seems to be around 64 writers with or
without the patch from a couple years ago, but the slope is shallower
after that. It would be good to make sure that it can't regress near
the peak, even with a "long tail" case (see next paragraph). The first
benchmark above starts at 256, so we can't tell where the peak is. It
might be worth it to also have a microbenchmark because the systemic
one has enough noise to obscure what's going on unless there are a
very large number of writers. We know what a systemic benchmark can
tell us on extreme workloads past the peak, and the microbenchmark
would tell us "we need to see X improvement here in order to see Y
improvement in the system benchmark".

I suspect that there could be a regression lurking for some inputs
that the benchmark doesn't look at: pg_lfind32() currently needs to be
able to read 4 vector registers worth of elements before taking the
fast path. There is then a tail of up to 15 elements that are now
checked one-by-one, but AVX2 would increase that to 31. That's getting
big enough to be noticeable, I suspect. It would be good to understand
that case (n*32 + 31), because it may also be relevant now. It's also
easy to improve for SSE2/NEON for v17.

Also, by reading 4 registers per loop iteration, that's 128 bytes on
AVX2. I'm not sure that matters, but we shouldn't assume it doesn't.
Code I've seen elsewhere reads a fixed 64-byte block, and then uses 1,
2, or 4 registers to handle it, depending on architecture. Whether or
not that's worth it in this case, this patch does mean future patches
will have to wonder if they have to do anything differently depending
on vector length, whereas now they don't. That's not a deal-breaker,
but it is a trade-off to keep in mind.



pgsql-hackers by date:

Previous
From: Jeevan Chalke
Date:
Subject: Re: More new SQL/JSON item methods
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: speed up a logical replica setup