Re: add AVX2 support to simd.h - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: add AVX2 support to simd.h |
Date | |
Msg-id | CANWCAZZgOS+f2W8+bFwHjYYT-6Fx4yP6dFpUy4oLfbXSgfTKLw@mail.gmail.com Whole thread Raw |
In response to | add AVX2 support to simd.h (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
Re: add AVX2 support to simd.h
|
List | pgsql-hackers |
On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Using the same benchmark as we did for the SSE2 linear searches in > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following: I've been antagonistic towards the patch itself, but it'd be more productive if I paid some nuanced attention to the problem it's trying to solve. First, I'd like to understand the benchmark a bit better. > writers sse2 avx2 % > 256 1195 1188 -1 > 512 928 1054 +14 > 1024 633 716 +13 > 2048 332 420 +27 > 4096 162 203 +25 > 8192 162 182 +12 There doesn't seem to be any benefit at 256 at all. Is that expected and/or fine? > It's been a while since I ran these benchmarks, but I vaguely recall also > seeing something like a 50% improvement for a dedicated pg_lfind32() > benchmark on long arrays. The latest I see in https://www.postgresql.org/message-id/20220808223254.GA1393216%40nathanxps13 writers head patch 8 672 680 16 639 664 32 701 689 64 705 703 128 628 653 256 576 627 512 530 584 768 450 536 1024 350 494 Here, the peak throughput seems to be around 64 writers with or without the patch from a couple years ago, but the slope is shallower after that. It would be good to make sure that it can't regress near the peak, even with a "long tail" case (see next paragraph). The first benchmark above starts at 256, so we can't tell where the peak is. It might be worth it to also have a microbenchmark because the systemic one has enough noise to obscure what's going on unless there are a very large number of writers. We know what a systemic benchmark can tell us on extreme workloads past the peak, and the microbenchmark would tell us "we need to see X improvement here in order to see Y improvement in the system benchmark". I suspect that there could be a regression lurking for some inputs that the benchmark doesn't look at: pg_lfind32() currently needs to be able to read 4 vector registers worth of elements before taking the fast path. There is then a tail of up to 15 elements that are now checked one-by-one, but AVX2 would increase that to 31. That's getting big enough to be noticeable, I suspect. It would be good to understand that case (n*32 + 31), because it may also be relevant now. It's also easy to improve for SSE2/NEON for v17. Also, by reading 4 registers per loop iteration, that's 128 bytes on AVX2. I'm not sure that matters, but we shouldn't assume it doesn't. Code I've seen elsewhere reads a fixed 64-byte block, and then uses 1, 2, or 4 registers to handle it, depending on architecture. Whether or not that's worth it in this case, this patch does mean future patches will have to wonder if they have to do anything differently depending on vector length, whereas now they don't. That's not a deal-breaker, but it is a trade-off to keep in mind.
pgsql-hackers by date: