Thread: add AVX2 support to simd.h
On Wed, Nov 22, 2023 at 12:49:35PM -0600, Nathan Bossart wrote: > On Wed, Nov 22, 2023 at 02:54:13PM +0200, Ants Aasma wrote: >> For reference, executing the page checksum 10M times on a AMD 3900X CPU: >> >> clang-14 -O2 4.292s (17.8 GiB/s) >> clang-14 -O2 -msse4.1 2.859s (26.7 GiB/s) >> clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s) > > Nice. I've noticed similar improvements with AVX2 intrinsics in simd.h. I've alluded to this a few times now, so I figured I'd park the patch and preliminary benchmarks in a new thread while we iron out how to support newer instructions (see discussion here [0]). Using the same benchmark as we did for the SSE2 linear searches in XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following: writers sse2 avx2 % 256 1195 1188 -1 512 928 1054 +14 1024 633 716 +13 2048 332 420 +27 4096 162 203 +25 8192 162 182 +12 It's been a while since I ran these benchmarks, but I vaguely recall also seeing something like a 50% improvement for a dedicated pg_lfind32() benchmark on long arrays. As is, the patch likely won't do anything unless you add -mavx2 or -march=native to your CFLAGS. I don't intend for this patch to be seriously considered until we have better support for detecting/compiling AVX2 instructions and a buildfarm machine that uses them. I plan to start another thread for AVX2 support for the page checksums. [0] https://postgr.es/m/20231107024734.GB729644%40nathanxps13 [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com [2] https://postgr.es/m/20220713170950.GA3116318%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > I don't intend for this patch to be > seriously considered until we have better support for detecting/compiling > AVX2 instructions and a buildfarm machine that uses them. That's completely understandable, yet I'm confused why there is a commitfest entry for it marked "needs review".
On Mon, Jan 01, 2024 at 07:12:26PM +0700, John Naylor wrote: > On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> I don't intend for this patch to be >> seriously considered until we have better support for detecting/compiling >> AVX2 instructions and a buildfarm machine that uses them. > > That's completely understandable, yet I'm confused why there is a > commitfest entry for it marked "needs review". Perhaps I was too optimistic about adding support for newer instructions... I'm tempted to propose that we move forward with this patch as-is after adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3. There is likely still follow-up work to make these improvements more accessible, but I'm not sure that is a strict prerequisite here. (In case it isn't clear, I'm volunteering to set up such a buildfarm machine.) -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > I'm tempted to propose that we move forward with this patch as-is after > adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3. > There is likely still follow-up work to make these improvements more > accessible, but I'm not sure that is a strict prerequisite here. The patch needs better comments (as in, more than "none whatsoever"). It doesn't need to be much though, perhaps like +#if defined(__AVX2__) + +/* + * When compiled with -mavx2 or allied options, we prefer AVX2 instructions. + */ +#include <immintrin.h> +#define USE_AVX2 +typedef __m256i Vector8; +typedef __m256i Vector32; Also, do you really want to structure the header so that USE_SSE2 doesn't get defined? In that case you are committing to provide an AVX2 replacement every single place that there's USE_SSE2, which doesn't seem like a great thing to require. OTOH, maybe there's no choice given than we need a different definition for Vector8 and Vector32? regards, tom lane
On Tue, Jan 02, 2024 at 12:50:04PM -0500, Tom Lane wrote: > The patch needs better comments (as in, more than "none whatsoever"). Yes, will do. > Also, do you really want to structure the header so that USE_SSE2 > doesn't get defined? In that case you are committing to provide > an AVX2 replacement every single place that there's USE_SSE2, which > doesn't seem like a great thing to require. OTOH, maybe there's > no choice given than we need a different definition for Vector8 and > Vector32? Yeah, the precedent is to use these abstracted types elsewhere so that any SIMD-related improvements aren't limited to one architecture. There are a couple of places that do explicitly check for USE_NO_SIMD, though. Maybe there's an eventual use-case for using SSE2 intrinsics even when you have AVX2 support, but for now, ensuring we have an AVX2 replacement for everything doesn't seem particularly burdensome. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Jan 2, 2024 at 11:11 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Perhaps I was too optimistic about adding support for newer instructions... > > I'm tempted to propose that we move forward with this patch as-is after > adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3. That means that we would be on the hook to fix it if it breaks, even though nothing uses it yet in a normal build. I have pending patches that will break, or get broken by, this, so minus-many from me until there is an availability story.
On Wed, Jan 03, 2024 at 09:13:52PM +0700, John Naylor wrote: > On Tue, Jan 2, 2024 at 11:11 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I'm tempted to propose that we move forward with this patch as-is after >> adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3. > > That means that we would be on the hook to fix it if it breaks, even > though nothing uses it yet in a normal build. I have pending patches > that will break, or get broken by, this, so minus-many from me until > there is an availability story. How will this break your patches? Is it just a matter of adding more AVX2 support, or something else? If the requirement is that normal builds use AVX2, then I fear we will be waiting a long time. IIUC the current proposals (building multiple binaries or adding a configuration option that maps to compiler flags) would still be opt-in, and I'm not sure we can mandate AVX2 support for all x86_64 builds anytime soon. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Jan 02, 2024 at 10:11:23AM -0600, Nathan Bossart wrote: > (In case it isn't clear, I'm volunteering to set up such a buildfarm > machine.) I set up "akepa" to run with -march=x86-64-v3. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Jan 3, 2024 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > If the requirement is that normal builds use AVX2, then I fear we will be > waiting a long time. IIUC the current proposals (building multiple > binaries or adding a configuration option that maps to compiler flags) > would still be opt-in, If and when we get one of those, I would consider that a "normal" build. Since there are no concrete proposals yet, I'm still waiting for you to justify imposing an immediate maintenance cost for zero benefit.
On Fri, Jan 05, 2024 at 09:03:39AM +0700, John Naylor wrote: > On Wed, Jan 3, 2024 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> If the requirement is that normal builds use AVX2, then I fear we will be >> waiting a long time. IIUC the current proposals (building multiple >> binaries or adding a configuration option that maps to compiler flags) >> would still be opt-in, > > If and when we get one of those, I would consider that a "normal" > build. Since there are no concrete proposals yet, I'm still waiting > for you to justify imposing an immediate maintenance cost for zero > benefit. I've been thinking about the configuration option approach. ISTM that would be the most feasible strategy, at least for v17. A couple things come to mind: * This option would simply map to existing compiler flags. We already have ways to provide those (-Dc_args in meson, CFLAGS in autoconf). Perhaps we'd want to provide our own shorthand for certain platforms (e.g., ARM), but that will still just be shorthand for compiler flags. * Such an option would itself generate some maintenance cost. That could be worth it because it formalizes the Postgres support for those options, but it's still one more thing to track. Another related option could be to simply document that we have support for some newer instructions that can be enabled by setting the aforementioned compiler flags. That's perhaps a little less user-friendly, but it'd avoid the duplication and possibly reduce the maintenance cost. I also wonder if it'd help prevent confusion when CFLAGS and this extra option conflict. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Using the same benchmark as we did for the SSE2 linear searches in > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following: I've been antagonistic towards the patch itself, but it'd be more productive if I paid some nuanced attention to the problem it's trying to solve. First, I'd like to understand the benchmark a bit better. > writers sse2 avx2 % > 256 1195 1188 -1 > 512 928 1054 +14 > 1024 633 716 +13 > 2048 332 420 +27 > 4096 162 203 +25 > 8192 162 182 +12 There doesn't seem to be any benefit at 256 at all. Is that expected and/or fine? > It's been a while since I ran these benchmarks, but I vaguely recall also > seeing something like a 50% improvement for a dedicated pg_lfind32() > benchmark on long arrays. The latest I see in https://www.postgresql.org/message-id/20220808223254.GA1393216%40nathanxps13 writers head patch 8 672 680 16 639 664 32 701 689 64 705 703 128 628 653 256 576 627 512 530 584 768 450 536 1024 350 494 Here, the peak throughput seems to be around 64 writers with or without the patch from a couple years ago, but the slope is shallower after that. It would be good to make sure that it can't regress near the peak, even with a "long tail" case (see next paragraph). The first benchmark above starts at 256, so we can't tell where the peak is. It might be worth it to also have a microbenchmark because the systemic one has enough noise to obscure what's going on unless there are a very large number of writers. We know what a systemic benchmark can tell us on extreme workloads past the peak, and the microbenchmark would tell us "we need to see X improvement here in order to see Y improvement in the system benchmark". I suspect that there could be a regression lurking for some inputs that the benchmark doesn't look at: pg_lfind32() currently needs to be able to read 4 vector registers worth of elements before taking the fast path. There is then a tail of up to 15 elements that are now checked one-by-one, but AVX2 would increase that to 31. That's getting big enough to be noticeable, I suspect. It would be good to understand that case (n*32 + 31), because it may also be relevant now. It's also easy to improve for SSE2/NEON for v17. Also, by reading 4 registers per loop iteration, that's 128 bytes on AVX2. I'm not sure that matters, but we shouldn't assume it doesn't. Code I've seen elsewhere reads a fixed 64-byte block, and then uses 1, 2, or 4 registers to handle it, depending on architecture. Whether or not that's worth it in this case, this patch does mean future patches will have to wonder if they have to do anything differently depending on vector length, whereas now they don't. That's not a deal-breaker, but it is a trade-off to keep in mind.
On Sat, Jan 6, 2024 at 12:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > I've been thinking about the configuration option approach. ISTM that > would be the most feasible strategy, at least for v17. A couple things > come to mind: > > * This option would simply map to existing compiler flags. We already have > ways to provide those (-Dc_args in meson, CFLAGS in autoconf). Perhaps > we'd want to provide our own shorthand for certain platforms (e.g., ARM), > but that will still just be shorthand for compiler flags. > > * Such an option would itself generate some maintenance cost. That could > be worth it because it formalizes the Postgres support for those options, > but it's still one more thing to track. > > Another related option could be to simply document that we have support for > some newer instructions that can be enabled by setting the aforementioned > compiler flags. That's perhaps a little less user-friendly, but it'd avoid > the duplication and possibly reduce the maintenance cost. I also wonder if > it'd help prevent confusion when CFLAGS and this extra option conflict. The last one might offer more graceful forward compatibility if the multiple-binaries idea gets any traction some day, because at that point the additional config options are not needed, I think. Another consideration is which way would touch the fewest places to work with Windows, which uses the spelling /arch:AVX2 etc. One small thing I would hope for from the finial version of this is the ability to inline things where we currently indirect depending on a run-time check. That seems like "just work" on top of everything else, and I don't think it makes a case for either of the above.
On Mon, Jan 08, 2024 at 02:01:39PM +0700, John Naylor wrote: > On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> writers sse2 avx2 % >> 256 1195 1188 -1 >> 512 928 1054 +14 >> 1024 633 716 +13 >> 2048 332 420 +27 >> 4096 162 203 +25 >> 8192 162 182 +12 > > There doesn't seem to be any benefit at 256 at all. Is that expected > and/or fine? My unverified assumption is that the linear searches make up much less of the benchmark at these lower client counts, so any improvements we make here are unlikely to show up here. IIRC even the hash table approach that we originally explored for XidInMVCCSnapshot() didn't do much, if anything, for the benchmark at lower client counts. > Here, the peak throughput seems to be around 64 writers with or > without the patch from a couple years ago, but the slope is shallower > after that. It would be good to make sure that it can't regress near > the peak, even with a "long tail" case (see next paragraph). The first > benchmark above starts at 256, so we can't tell where the peak is. It > might be worth it to also have a microbenchmark because the systemic > one has enough noise to obscure what's going on unless there are a > very large number of writers. We know what a systemic benchmark can > tell us on extreme workloads past the peak, and the microbenchmark > would tell us "we need to see X improvement here in order to see Y > improvement in the system benchmark". Yes, will do. > I suspect that there could be a regression lurking for some inputs > that the benchmark doesn't look at: pg_lfind32() currently needs to be > able to read 4 vector registers worth of elements before taking the > fast path. There is then a tail of up to 15 elements that are now > checked one-by-one, but AVX2 would increase that to 31. That's getting > big enough to be noticeable, I suspect. It would be good to understand > that case (n*32 + 31), because it may also be relevant now. It's also > easy to improve for SSE2/NEON for v17. Good idea. If it is indeed noticeable, we might be able to "fix" it by processing some of the tail with shorter vectors. But that probably means finding a way to support multiple vector sizes on the same build, which would require some work. > Also, by reading 4 registers per loop iteration, that's 128 bytes on > AVX2. I'm not sure that matters, but we shouldn't assume it doesn't. > Code I've seen elsewhere reads a fixed 64-byte block, and then uses 1, > 2, or 4 registers to handle it, depending on architecture. Whether or > not that's worth it in this case, this patch does mean future patches > will have to wonder if they have to do anything differently depending > on vector length, whereas now they don't. That's not a deal-breaker, > but it is a trade-off to keep in mind. Yeah. Presently, this AVX2 patch just kicks the optimization down the road a bit for the existing use-cases, so you don't start using the vector registers until there's more data to work with, which might not even be noticeable. But it's conceivable that vector length could matter at some point, even if it doesn't matter much now. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > I suspect that there could be a regression lurking for some inputs > > that the benchmark doesn't look at: pg_lfind32() currently needs to be > > able to read 4 vector registers worth of elements before taking the > > fast path. There is then a tail of up to 15 elements that are now > > checked one-by-one, but AVX2 would increase that to 31. That's getting > > big enough to be noticeable, I suspect. It would be good to understand > > that case (n*32 + 31), because it may also be relevant now. It's also > > easy to improve for SSE2/NEON for v17. > > Good idea. If it is indeed noticeable, we might be able to "fix" it by > processing some of the tail with shorter vectors. But that probably means > finding a way to support multiple vector sizes on the same build, which > would require some work. What I had in mind was an overlapping pattern I've seen in various places: do one iteration at the beginning, then subtract the aligned-down length from the end and do all those iterations. And one-by-one is only used if the total length is small.
On 29.11.23 18:15, Nathan Bossart wrote: > Using the same benchmark as we did for the SSE2 linear searches in > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following: > > writers sse2 avx2 % > 256 1195 1188 -1 > 512 928 1054 +14 > 1024 633 716 +13 > 2048 332 420 +27 > 4096 162 203 +25 > 8192 162 182 +12 AFAICT, your patch merely provides an alternative AVX2 implementation for where currently SSE2 is supported, but it doesn't provide any new API calls or new functionality. One might naively expect that these are just two different ways to call the underlying primitives in the CPU, so these performance improvements are surprising to me. Or do the CPUs actually have completely separate machinery for SSE2 and AVX2, and just using the latter to do the same thing is faster?
On Tue, 9 Jan 2024 at 16:03, Peter Eisentraut <peter@eisentraut.org> wrote: > On 29.11.23 18:15, Nathan Bossart wrote: > > Using the same benchmark as we did for the SSE2 linear searches in > > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following: > > > > writers sse2 avx2 % > > 256 1195 1188 -1 > > 512 928 1054 +14 > > 1024 633 716 +13 > > 2048 332 420 +27 > > 4096 162 203 +25 > > 8192 162 182 +12 > > AFAICT, your patch merely provides an alternative AVX2 implementation > for where currently SSE2 is supported, but it doesn't provide any new > API calls or new functionality. One might naively expect that these are > just two different ways to call the underlying primitives in the CPU, so > these performance improvements are surprising to me. Or do the CPUs > actually have completely separate machinery for SSE2 and AVX2, and just > using the latter to do the same thing is faster? The AVX2 implementation uses a wider vector register. On most current processors the throughput of the instructions in question is the same on 256bit vectors as on 128bit vectors. Basically, the chip has AVX2 worth of machinery and using SSE2 leaves half of it unused. Notable exceptions are efficiency cores on recent Intel desktop CPUs and AMD CPUs pre Zen 2 where AVX2 instructions are internally split up into two 128bit wide instructions. For AVX512 the picture is much more complicated. Some instructions run at half rate, some at full rate, but not on all ALU ports, some instructions cause aggressive clock rate reduction on some microarchitectures. AVX-512 adds mask registers and masked vector instructions that enable quite a bit simpler code in many cases. Interestingly I have seen Clang make quite effective use of these masked instructions even when using AVX2 intrinsics, but targeting an AVX-512 capable platform. The vector width independent approach used in the patch is nice for simple cases by not needing a separate implementation for each vector width. However for more complicated cases where "horizontal" operations are needed it's going to be much less useful. But these cases can easily just drop down to using intrinsics directly.
On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote: > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> >> > I suspect that there could be a regression lurking for some inputs >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be >> > able to read 4 vector registers worth of elements before taking the >> > fast path. There is then a tail of up to 15 elements that are now >> > checked one-by-one, but AVX2 would increase that to 31. That's getting >> > big enough to be noticeable, I suspect. It would be good to understand >> > that case (n*32 + 31), because it may also be relevant now. It's also >> > easy to improve for SSE2/NEON for v17. >> >> Good idea. If it is indeed noticeable, we might be able to "fix" it by >> processing some of the tail with shorter vectors. But that probably means >> finding a way to support multiple vector sizes on the same build, which >> would require some work. > > What I had in mind was an overlapping pattern I've seen in various > places: do one iteration at the beginning, then subtract the > aligned-down length from the end and do all those iterations. And > one-by-one is only used if the total length is small. Sorry, I'm not sure I understood this. Do you mean processing the first several elements individually or with SSE2 until the number of remaining elements can be processed with just the AVX2 instructions (a bit like how pg_comp_crc32c_armv8() is structured for memory alignment)? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, 9 Jan 2024 at 18:20, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote: > > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> > >> > I suspect that there could be a regression lurking for some inputs > >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be > >> > able to read 4 vector registers worth of elements before taking the > >> > fast path. There is then a tail of up to 15 elements that are now > >> > checked one-by-one, but AVX2 would increase that to 31. That's getting > >> > big enough to be noticeable, I suspect. It would be good to understand > >> > that case (n*32 + 31), because it may also be relevant now. It's also > >> > easy to improve for SSE2/NEON for v17. > >> > >> Good idea. If it is indeed noticeable, we might be able to "fix" it by > >> processing some of the tail with shorter vectors. But that probably means > >> finding a way to support multiple vector sizes on the same build, which > >> would require some work. > > > > What I had in mind was an overlapping pattern I've seen in various > > places: do one iteration at the beginning, then subtract the > > aligned-down length from the end and do all those iterations. And > > one-by-one is only used if the total length is small. > > Sorry, I'm not sure I understood this. Do you mean processing the first > several elements individually or with SSE2 until the number of remaining > elements can be processed with just the AVX2 instructions (a bit like how > pg_comp_crc32c_armv8() is structured for memory alignment)? For some operations (min, max, = any) processing the same elements multiple times doesn't change the result. So the vectors for first and/or last iterations can overlap with the main loop. In other cases it's possible to mask out the invalid elements and replace them with zeroes. Something along the lines of: static inline Vector8 vector8_mask_right(int num_valid) { __m256i seq = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); return _mm256_cmpgt_epi8(_mm256_set1_epi8(num_valid), seq); } /* final incomplete iteration */ Vector8 mask = vector8_mask_right(end - cur); final_vec = vector8_and((Vector8*) (end - sizeof(Vector8), mask); accum = vector8_add(accum, final_vec); It helps that on any halfway recent x86 unaligned loads only have a minor performance penalty and only when straddling cache line boundaries. Not sure what the state on ARM is. If we don't care about unaligned loads then we only need to care about the load not crossing page boundaries which could cause segfaults. Though I'm sure memory sanitizer tools will have plenty to complain about around such hacks.
On Tue, Jan 9, 2024 at 11:20 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote: > > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> > >> > I suspect that there could be a regression lurking for some inputs > >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be > >> > able to read 4 vector registers worth of elements before taking the > >> > fast path. There is then a tail of up to 15 elements that are now > >> > checked one-by-one, but AVX2 would increase that to 31. That's getting > >> > big enough to be noticeable, I suspect. It would be good to understand > >> > that case (n*32 + 31), because it may also be relevant now. It's also > >> > easy to improve for SSE2/NEON for v17. > >> > >> Good idea. If it is indeed noticeable, we might be able to "fix" it by > >> processing some of the tail with shorter vectors. But that probably means > >> finding a way to support multiple vector sizes on the same build, which > >> would require some work. > > > > What I had in mind was an overlapping pattern I've seen in various > > places: do one iteration at the beginning, then subtract the > > aligned-down length from the end and do all those iterations. And > > one-by-one is only used if the total length is small. > > Sorry, I'm not sure I understood this. Do you mean processing the first > several elements individually or with SSE2 until the number of remaining > elements can be processed with just the AVX2 instructions (a bit like how > pg_comp_crc32c_armv8() is structured for memory alignment)? If we have say 25 elements, I mean (for SSE2) check the first 16, then the last 16. Some will be checked twice, but that's okay.
On Wed, Jan 10, 2024 at 09:06:08AM +0700, John Naylor wrote: > If we have say 25 elements, I mean (for SSE2) check the first 16, then > the last 16. Some will be checked twice, but that's okay. I finally got around to trying this. 0001 adds this overlapping logic. 0002 is a rebased version of the AVX2 patch (it needed some updates after commit 9f225e9). And 0003 is a benchmark for test_lfind32(). It runs pg_lfind32() on an array of the given size 100M times. I've also attached the results of running this benchmark on my machine at HEAD, after applying 0001, and after applying both 0001 and 0002. 0001 appears to work pretty well. When there is a small "tail," it regresses a small amount, but overall, it seems to improve more cases than it harms. 0002 does regress searches on smaller arrays quite a bit, since it postpones the SIMD optimizations until the arrays are longer. It might be possible to mitigate by using 2 registers when the "tail" is long enough, but I have yet to try that. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote: > I've also attached the results of running this benchmark on my machine at > HEAD, after applying 0001, and after applying both 0001 and 0002. 0001 > appears to work pretty well. When there is a small "tail," it regresses a > small amount, but overall, it seems to improve more cases than it harms. > 0002 does regress searches on smaller arrays quite a bit, since it > postpones the SIMD optimizations until the arrays are longer. It might be > possible to mitigate by using 2 registers when the "tail" is long enough, > but I have yet to try that. The attached 0003 is a sketch of what such mitigation might look like. It appears to help with the regressions nicely. I omitted the benchmarking patch in v3 to appease cfbot. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Sat, Mar 16, 2024 at 2:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote: > > I've also attached the results of running this benchmark on my machine at > > HEAD, after applying 0001, and after applying both 0001 and 0002. 0001 > > appears to work pretty well. When there is a small "tail," it regresses a > > small amount, but overall, it seems to improve more cases than it harms. > > 0002 does regress searches on smaller arrays quite a bit, since it > > postpones the SIMD optimizations until the arrays are longer. It might be > > possible to mitigate by using 2 registers when the "tail" is long enough, > > but I have yet to try that. > > The attached 0003 is a sketch of what such mitigation might look like. It > appears to help with the regressions nicely. I omitted the benchmarking > patch in v3 to appease cfbot. I haven't looked at the patches, but the graphs look good.
On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote: > I haven't looked at the patches, but the graphs look good. I spent some more time on these patches. Specifically, I reordered them to demonstrate the effects on systems without AVX2 support. I've also added a shortcut to jump to the one-by-one approach when there aren't many elements, as the overhead becomes quite noticeable otherwise. Finally, I ran the same benchmarks again on x86 and Arm out to 128 elements. Overall, I think 0001 and 0002 are in decent shape, although I'm wondering if it's possible to improve the style a bit. 0003 at least needs a big comment in simd.h, and it might need a note in the documentation, too. If the approach in this patch set seems reasonable, I'll spend some time on that. BTW I did try to add some other optimizations, such as processing remaining elements with only one vector and trying to use the overlapping strategy with more registers if we know there are relatively many remaining elements. These other approaches all added a lot of complexity and began hurting performance, and I've probably already spent way too much time optimizing a linear search, so this is where I've decided to stop. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Mar 19, 2024 at 9:03 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote: > > I haven't looked at the patches, but the graphs look good. > > I spent some more time on these patches. Specifically, I reordered them to > demonstrate the effects on systems without AVX2 support. I've also added a > shortcut to jump to the one-by-one approach when there aren't many > elements, as the overhead becomes quite noticeable otherwise. Finally, I > ran the same benchmarks again on x86 and Arm out to 128 elements. > > Overall, I think 0001 and 0002 are in decent shape, although I'm wondering > if it's possible to improve the style a bit. I took a brief look, and 0001 isn't quite what I had in mind. I can't quite tell what it's doing with the additional branches and "goto retry", but I meant something pretty simple: - if short, do one element at a time and return - if long, do one block unconditionally, then round the start pointer up so that "end - start" is an exact multiple of blocks, and loop over them
On Tue, Mar 19, 2024 at 10:03:36AM +0700, John Naylor wrote: > I took a brief look, and 0001 isn't quite what I had in mind. I can't > quite tell what it's doing with the additional branches and "goto > retry", but I meant something pretty simple: Do you mean 0002? 0001 just adds a 2-register loop for remaining elements once we've exhausted what can be processed with the 4-register loop. > - if short, do one element at a time and return 0002 does this. > - if long, do one block unconditionally, then round the start pointer > up so that "end - start" is an exact multiple of blocks, and loop over > them 0002 does the opposite of this. That is, after we've completed as many blocks as possible, we move the iterator variable back to "end - block_size" and do one final iteration to cover all the remaining elements. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Mar 19, 2024 at 10:03:36AM +0700, John Naylor wrote: > > I took a brief look, and 0001 isn't quite what I had in mind. I can't > > quite tell what it's doing with the additional branches and "goto > > retry", but I meant something pretty simple: > > Do you mean 0002? 0001 just adds a 2-register loop for remaining elements > once we've exhausted what can be processed with the 4-register loop. Sorry, I was looking at v2 at the time. > > - if short, do one element at a time and return > > 0002 does this. That part looks fine. > > - if long, do one block unconditionally, then round the start pointer > > up so that "end - start" is an exact multiple of blocks, and loop over > > them > > 0002 does the opposite of this. That is, after we've completed as many > blocks as possible, we move the iterator variable back to "end - > block_size" and do one final iteration to cover all the remaining elements. Sounds similar in principle, but it looks really complicated. I don't think the additional loops and branches are a good way to go, either for readability or for branch prediction. My sketch has one branch for which loop to do, and then performs only one loop. Let's do the simplest thing that could work. (I think we might need a helper function to do the block, but the rest should be easy)
On Tue, Mar 19, 2024 at 04:53:04PM +0700, John Naylor wrote: > On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> 0002 does the opposite of this. That is, after we've completed as many >> blocks as possible, we move the iterator variable back to "end - >> block_size" and do one final iteration to cover all the remaining elements. > > Sounds similar in principle, but it looks really complicated. I don't > think the additional loops and branches are a good way to go, either > for readability or for branch prediction. My sketch has one branch for > which loop to do, and then performs only one loop. Let's do the > simplest thing that could work. (I think we might need a helper > function to do the block, but the rest should be easy) I tried to trim some of the branches, and came up with the attached patch. I don't think this is exactly what you were suggesting, but I think it's relatively close. My testing showed decent benefits from using 2 vectors when there aren't enough elements for 4, so I've tried to keep that part intact. This changes pg_lfind32() to something like: if not many elements process one by one while enough elements for 4 registers remain process with 4 registers if no elements remain return false if more than 2-registers-worth of elements remain do one iteration with 2 registers do another iteration on last 2-registers-worth of elements -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Sounds similar in principle, but it looks really complicated. I don't > > think the additional loops and branches are a good way to go, either > > for readability or for branch prediction. My sketch has one branch for > > which loop to do, and then performs only one loop. Let's do the > > simplest thing that could work. (I think we might need a helper > > function to do the block, but the rest should be easy) > > I tried to trim some of the branches, and came up with the attached patch. > I don't think this is exactly what you were suggesting, but I think it's > relatively close. My testing showed decent benefits from using 2 vectors > when there aren't enough elements for 4, so I've tried to keep that part > intact. I would caution against that if the benchmark is repeatedly running against a static number of elements, because the branch predictor will be right all the time (except maybe when it exits a loop, not sure). We probably don't need to go to the trouble to construct a benchmark with some added randomness, but we have be careful not to overfit what the test is actually measuring.
On Wed, Mar 20, 2024 at 01:57:54PM +0700, John Naylor wrote: > On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> I tried to trim some of the branches, and came up with the attached patch. >> I don't think this is exactly what you were suggesting, but I think it's >> relatively close. My testing showed decent benefits from using 2 vectors >> when there aren't enough elements for 4, so I've tried to keep that part >> intact. > > I would caution against that if the benchmark is repeatedly running > against a static number of elements, because the branch predictor will > be right all the time (except maybe when it exits a loop, not sure). > We probably don't need to go to the trouble to construct a benchmark > with some added randomness, but we have be careful not to overfit what > the test is actually measuring. I don't mind removing the 2-register stuff if that's what you think we should do. I'm cautiously optimistic that it'd help more than the extra branch prediction might hurt, and it'd at least help avoid regressing the lower end for the larger AVX2 registers, but I probably won't be able to prove that without constructing another benchmark. And TBH I'm not sure it'll significantly impact any real-world workload, anyway. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote: > On Wed, Mar 20, 2024 at 01:57:54PM +0700, John Naylor wrote: >> On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart >> <nathandbossart@gmail.com> wrote: >>> I tried to trim some of the branches, and came up with the attached patch. >>> I don't think this is exactly what you were suggesting, but I think it's >>> relatively close. My testing showed decent benefits from using 2 vectors >>> when there aren't enough elements for 4, so I've tried to keep that part >>> intact. >> >> I would caution against that if the benchmark is repeatedly running >> against a static number of elements, because the branch predictor will >> be right all the time (except maybe when it exits a loop, not sure). >> We probably don't need to go to the trouble to construct a benchmark >> with some added randomness, but we have be careful not to overfit what >> the test is actually measuring. > > I don't mind removing the 2-register stuff if that's what you think we > should do. I'm cautiously optimistic that it'd help more than the extra > branch prediction might hurt, and it'd at least help avoid regressing the > lower end for the larger AVX2 registers, but I probably won't be able to > prove that without constructing another benchmark. And TBH I'm not sure > it'll significantly impact any real-world workload, anyway. Here's a new version of the patch set with the 2-register stuff removed, plus a fresh run of the benchmark. The weird spike for AVX2 is what led me down the 2-register path earlier. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 21, 2024 at 2:55 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote: > > I don't mind removing the 2-register stuff if that's what you think we > > should do. I'm cautiously optimistic that it'd help more than the extra > > branch prediction might hurt, and it'd at least help avoid regressing the > > lower end for the larger AVX2 registers, but I probably won't be able to > > prove that without constructing another benchmark. And TBH I'm not sure > > it'll significantly impact any real-world workload, anyway. > > Here's a new version of the patch set with the 2-register stuff removed, I'm much happier about v5-0001. With a small tweak it would match what I had in mind: + if (nelem < nelem_per_iteration) + goto one_by_one; If this were "<=" then the for long arrays we could assume there is always more than one block, and wouldn't need to check if any elements remain -- first block, then a single loop and it's done. The loop could also then be a "do while" since it doesn't have to check the exit condition up front. > plus a fresh run of the benchmark. The weird spike for AVX2 is what led me > down the 2-register path earlier. Yes, that spike is weird, because it seems super-linear. However, the more interesting question for me is: AVX2 isn't really buying much for the numbers covered in this test. Between 32 and 48 elements, and between 64 and 80, it's indistinguishable from SSE2. The jumps to the next shelf are postponed, but the jumps are just as high. From earlier system benchmarks, I recall it eventually wins out with hundreds of elements, right? Is that still true? Further, now that the algorithm is more SIMD-appropriate, I wonder what doing 4 registers at a time is actually buying us for either SSE2 or AVX2. It might just be a matter of scale, but that would be good to understand.
On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote: > I'm much happier about v5-0001. With a small tweak it would match what > I had in mind: > > + if (nelem < nelem_per_iteration) > + goto one_by_one; > > If this were "<=" then the for long arrays we could assume there is > always more than one block, and wouldn't need to check if any elements > remain -- first block, then a single loop and it's done. > > The loop could also then be a "do while" since it doesn't have to > check the exit condition up front. Good idea. That causes us to re-check all of the tail elements when the number of elements is evenly divisible by nelem_per_iteration, but that might be worth the trade-off. > Yes, that spike is weird, because it seems super-linear. However, the > more interesting question for me is: AVX2 isn't really buying much for > the numbers covered in this test. Between 32 and 48 elements, and > between 64 and 80, it's indistinguishable from SSE2. The jumps to the > next shelf are postponed, but the jumps are just as high. From earlier > system benchmarks, I recall it eventually wins out with hundreds of > elements, right? Is that still true? It does still eventually win, although not nearly to the same extent as before. I extended the benchmark a bit to show this. I wouldn't be devastated if we only got 0001 committed for v17, given these results. > Further, now that the algorithm is more SIMD-appropriate, I wonder > what doing 4 registers at a time is actually buying us for either SSE2 > or AVX2. It might just be a matter of scale, but that would be good to > understand. I'll follow up with these numbers shortly. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Mar 21, 2024 at 12:09:44PM -0500, Nathan Bossart wrote: > It does still eventually win, although not nearly to the same extent as > before. I extended the benchmark a bit to show this. I wouldn't be > devastated if we only got 0001 committed for v17, given these results. (In case it isn't clear from the graph, after 128 elements, I only tested at 200, 300, 400, etc. elements.) -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Mar 21, 2024 at 12:09:44PM -0500, Nathan Bossart wrote: > On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote: >> Further, now that the algorithm is more SIMD-appropriate, I wonder >> what doing 4 registers at a time is actually buying us for either SSE2 >> or AVX2. It might just be a matter of scale, but that would be good to >> understand. > > I'll follow up with these numbers shortly. It looks like the 4-register code still outperforms the 2-register code, except for a handful of cases where there aren't many elements. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Here's a new version of 0001 with some added #ifdefs that cfbot revealed were missing. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, Mar 24, 2024 at 03:53:17PM -0500, Nathan Bossart wrote: > Here's a new version of 0001 with some added #ifdefs that cfbot revealed > were missing. Sorry for the noise. cfbot revealed another silly mistake (forgetting to reset the "i" variable in the assertion path). That should be fixed in v8. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Mar 22, 2024 at 12:09 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote: > > If this were "<=" then the for long arrays we could assume there is > > always more than one block, and wouldn't need to check if any elements > > remain -- first block, then a single loop and it's done. > > > > The loop could also then be a "do while" since it doesn't have to > > check the exit condition up front. > > Good idea. That causes us to re-check all of the tail elements when the > number of elements is evenly divisible by nelem_per_iteration, but that > might be worth the trade-off. Yeah, if there's no easy way to avoid that it's probably fine. I wonder if we can subtract one first to force even multiples to round down, although I admit I haven't thought through the consequences of that. > [v8] Seems pretty good. It'd be good to see the results of 2- vs. 4-register before committing, because that might lead to some restructuring, but maybe it won't, and v8 is already an improvement over HEAD. /* Process the remaining elements one at a time. */ This now does all of them if that path is taken, so "remaining" can be removed.
On Mon, Mar 25, 2024 at 10:03:27AM +0700, John Naylor wrote: > Seems pretty good. It'd be good to see the results of 2- vs. > 4-register before committing, because that might lead to some > restructuring, but maybe it won't, and v8 is already an improvement > over HEAD. I tested this the other day [0] (only for x86). The results seemed to indicate that the 4-register approach was still quite a bit better. > /* Process the remaining elements one at a time. */ > > This now does all of them if that path is taken, so "remaining" can be removed. Right, will do. [0] https://postgr.es/m/20240321183823.GA1800896%40nathanxps13 -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Here is what I have staged for commit. One notable difference in this version of the patch is that I've changed + if (nelem <= nelem_per_iteration) + goto one_by_one; to + if (nelem < nelem_per_iteration) + goto one_by_one; I realized that there's no reason to jump to the one-by-one linear search code when nelem == nelem_per_iteration, as the worst thing that will happen is that we'll process all the elements twice if the value isn't present in the array. My benchmark that I've been using also shows a significant speedup for this case with this change (on the order of 75%), which I imagine might be due to a combination of branch prediction, caching, fewer instructions, etc. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
I've committed v9, and I've marked the commitfest entry as "Committed," although we may want to revisit AVX2, etc. in the future. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > I've committed v9, and I've marked the commitfest entry as "Committed," > although we may want to revisit AVX2, etc. in the future. A significant fraction of the buildfarm is issuing warnings about this. adder | 2024-03-26 21:04:33 | ../pgsql/src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined butnot used [-Wunused-label] buri | 2024-03-26 21:16:09 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but notused [-Wunused-label] cavefish | 2024-03-26 22:53:23 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but notused [-Wunused-label] cisticola | 2024-03-26 22:20:07 | ../../../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' definedbut not used [-Wunused-label] lancehead | 2024-03-26 21:48:17 | ../../src/include/port/pg_lfind.h:199:1: warning: unused label 'one_by_one' [-Wunused-label] nicator | 2024-03-26 21:08:14 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but notused [-Wunused-label] nuthatch | 2024-03-26 22:00:04 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but notused [-Wunused-label] rinkhals | 2024-03-26 19:51:32 | ../../src/include/port/pg_lfind.h:199:1: warning: unused label 'one_by_one' [-Wunused-label] siskin | 2024-03-26 19:59:29 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but notused [-Wunused-label] regards, tom lane
On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> I've committed v9, and I've marked the commitfest entry as "Committed," >> although we may want to revisit AVX2, etc. in the future. > > A significant fraction of the buildfarm is issuing warnings about > this. Thanks for the heads-up. Will fix. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Mar 26, 2024 at 06:55:54PM -0500, Nathan Bossart wrote: > On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote: >> A significant fraction of the buildfarm is issuing warnings about >> this. > > Thanks for the heads-up. Will fix. Done. I'll keep an eye on the farm. I just did the minimal fix for now, i.e., I moved the new label into the SIMD section of the function. I think it would be better stylistically to move the one-by-one logic to an inline helper function, but I didn't do that just in case it might negatively impact performance. I'll look into this and will follow up with another patch if it looks good. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Mar 26, 2024 at 06:55:54PM -0500, Nathan Bossart wrote: >> On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote: >>> A significant fraction of the buildfarm is issuing warnings about >>> this. > Done. I'll keep an eye on the farm. Thanks. > I just did the minimal fix for now, i.e., I moved the new label into the > SIMD section of the function. I think it would be better stylistically to > move the one-by-one logic to an inline helper function, but I didn't do > that just in case it might negatively impact performance. I'll look into > this and will follow up with another patch if it looks good. Sounds like a plan. regards, tom lane
On Tue, Mar 26, 2024 at 09:48:57PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> I just did the minimal fix for now, i.e., I moved the new label into the >> SIMD section of the function. I think it would be better stylistically to >> move the one-by-one logic to an inline helper function, but I didn't do >> that just in case it might negatively impact performance. I'll look into >> this and will follow up with another patch if it looks good. > > Sounds like a plan. Here's what I had in mind. My usual benchmark seems to indicate that this shouldn't impact performance. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Nathan Bossart <nathandbossart@gmail.com> writes: > Here's what I had in mind. My usual benchmark seems to indicate that this > shouldn't impact performance. Shouldn't "i" be declared uint32, since nelem is? BTW, I wonder why these functions don't declare their array arguments like "const uint32 *base". LGTM otherwise, and I like the fact that the #if structure gets a lot less messy. regards, tom lane
On Wed, Mar 27, 2024 at 05:10:13PM -0400, Tom Lane wrote: > Shouldn't "i" be declared uint32, since nelem is? Yes, that's a mistake. > BTW, I wonder why these functions don't declare their array > arguments like "const uint32 *base". They probably should. I don't see any reason not to, and my compiler doesn't complain, either. > LGTM otherwise, and I like the fact that the #if structure > gets a lot less messy. Thanks for reviewing. I've attached a v2 that I intend to commit when I get a chance. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Mar 27, 2024 at 04:37:35PM -0500, Nathan Bossart wrote: > On Wed, Mar 27, 2024 at 05:10:13PM -0400, Tom Lane wrote: >> LGTM otherwise, and I like the fact that the #if structure >> gets a lot less messy. > > Thanks for reviewing. I've attached a v2 that I intend to commit when I > get a chance. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com