Thread: RE: Improve CRC32C performance on SSE4.2
Hi John,
> I'm highly suspicious of these numbers because they show this version
> is about 20x faster than "scalar", so relatively speaking 3x faster
> than the AVX-512 proposal?
Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote for testing CRC32C correctness (which computes crc one byte at time). The code for microbenchmarking is all here: https://github.com/r-devulap/crc32c and this commit https://github.com/r-devulap/crc32c/commit/ca4d1b24fd8af87aab544fb1634523b6657325a0 fixes that. Rerunning the benchmarks gives me more sensible numbers:
Comparing scalar_crc32c to sse42_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.0972 -0.0971 5 4 5 4
[scalar_crc32c vs. sse42_crc32c]/128 -0.3048 -0.3048 8 6 8 6
[scalar_crc32c vs. sse42_crc32c]/256 -0.4610 -0.4610 19 10 19 10
[scalar_crc32c vs. sse42_crc32c]/512 -0.6432 -0.6432 50 18 50 18
[scalar_crc32c vs. sse42_crc32c]/1024 -0.7192 -0.7192 121 34 121 34
[scalar_crc32c vs. sse42_crc32c]/2048 -0.7275 -0.7276 259 70 259 70
> Luckily, that's easily fixable: It turns out the implementation
> in the paper (and chromium) has a very inefficient finalization step,
> using more carryless multiplications and plenty of other operations.
> After the main loop, and at the end, it's much more efficient to
> convert the 128-bit intermediate result directly into a CRC in the
> usual way.
Thank you for pointing this out and also fixing it! This improves over the chromium version by 10-25% especially for with smaller byte size 64 – 512 bytes:
Comparing sse42_crc32c to corsix_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[sse42_crc32c vs. corsix_crc32c]/64 -0.2696 -0.2696 4 3 4 3
[sse42_crc32c vs. corsix_crc32c]/128 -0.1551 -0.1552 6 5 6 5
[sse42_crc32c vs. corsix_crc32c]/256 -0.1787 -0.1787 10 8 10 8
[sse42_crc32c vs. corsix_crc32c]/512 -0.1351 -0.1351 18 15 18 15
[sse42_crc32c vs. corsix_crc32c]/1024 -0.0972 -0.0972 34 31 34 31
[sse42_crc32c vs. corsix_crc32c]/2048 -0.0763 -0.0763 69 64 69 64
OVERALL_GEOMEAN -0.1544 -0.1544 0 0 0 0
> I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above.
I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm that behavior on your end too?
---
src/port/pg_crc32c_sse42.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index a8c1e5609b..a350b1b93a 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -81,6 +81,7 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)
#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))
#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))
pg_attribute_no_sanitize_alignment()
+__attribute__((optimize("-O3")))
pg_attribute_target("sse4.2,pclmul")
pg_crc32c
pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t len)
--
You could also just build with export CFLAGS="-O3" instead of adding the function attribute.
> I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.
> It's probably okay to fold these together in the same compile-time
> check, since both are fairly old by now, but for those following
> along, pclmul is not in SSE 4.2 and is a bit newer. So this would
> cause machines building on Nehalem (2008) to fail the check and go
> back to slicing-by-8 with it written this way.
Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly. Let me know which one you would prefer.
Raghuveer
From: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers@lists.postgresql.org
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Subject: Improve CRC32C performance on SSE4.2
This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium’s copy of zlib with modified constants for crc32c computation. See:
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):
Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.8147 -0.8148 33 6 33 6
[scalar_crc32c vs. sse42_crc32c]/128 -0.8962 -0.8962 88 9 88 9
[scalar_crc32c vs. sse42_crc32c]/256 -0.9200 -0.9200 211 17 211 17
[scalar_crc32c vs. sse42_crc32c]/512 -0.9389 -0.9389 486 30 486 30
[scalar_crc32c vs. sse42_crc32c]/1024 -0.9452 -0.9452 1037 57 1037 57
[scalar_crc32c vs. sse42_crc32c]/2048 -0.9456 -0.9456 2140 116 2140 116
Raghuveer
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsixis slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. That matches my findings as well. > I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates lessthan optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm thatbehavior on your end too? On my machine that still regresses compared to master in that range (although by not as much) so I still think 128 bytes is the right threshold. The effect of -O3 with gcc14.2 is that the single-block loop (after the 4-block loop) is unrolled. Unrolling adds branches and binary space, so it'd be nice to avoid that even for systems that build with -O3. I tried leaving out the single block loop, so that the tail is called for the remaining <63 bytes, and it's actually better: v2: 128 latency average = 10.256 ms 144 latency average = 11.393 ms 160 latency average = 12.977 ms 176 latency average = 14.364 ms 192 latency average = 12.627 ms remove single-block loop: 128 latency average = 10.211 ms 144 latency average = 10.987 ms 160 latency average = 12.094 ms 176 latency average = 12.927 ms 192 latency average = 12.466 ms Keeping the extra loop is probably better at this benchmark on newer machines, but I don't think it's worth it. Microbenchmarks with fixed sized input don't take branch mispredicts into account. > > I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction. > > May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor. It's an i7-7500U / Kaby Lake. > > It's probably okay to fold these together in the same compile-time > > check, since both are fairly old by now, but for those following > > along, pclmul is not in SSE 4.2 and is a bit newer. So this would > > cause machines building on Nehalem (2008) to fail the check and go > > back to slicing-by-8 with it written this way. > > Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crashwith segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid checkfor pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If youcare about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointeraccordingly. Let me know which one you would prefer. Okay, Nehalem is 17 years old, and the additional cpuid check would still work on hardware 14-15 years old, so I think it's fine to bump the requirement for runtime hardware support. -- John Naylor Amazon Web Services
Hi, > 2. Unfortunately, there is another wrinkle that I failed to consider: If you search > the web for "VirtualBox pclmulqdq" you can see a few reports from not very long > ago that some hypervisors don't enable the CPUID for pclmul. I don't know how > big a problem that is in practice today, but it seems we should actually have > separate checks, with fallback. Sorry I didn't think of this earlier. If someone using a VM that doesn't support a 15 yr old feature, then I would argue performance is not the top priority for them. But that’s up to you. I will work on adding it unless you change your mind. Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK features support? The former macro is used to enable running the SSE42 version without a runtime check when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it. The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. Let me know what you think. Raghuveer
On Wed, Feb 12, 2025 at 09:02:27PM +0000, Devulapalli, Raghuveer wrote: > Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK > features support? The former macro is used to enable running the SSE42 version without a runtime check > when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated > as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it. > The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. > Let me know what you think. I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer overhead if possible. I looked at switching to always using runtime checks for this stuff, and we concluded that we'd better not [0]. [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 -- nathan
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks for this > stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that will looka little ugly, I might need to think about a cleaner way to do this. Raghuveer
On Wed, Feb 12, 2025 at 09:48:57PM +0000, Devulapalli, Raghuveer wrote: >> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer >> overhead if possible. I looked at switching to always using runtime checks for this >> stuff, and we concluded that we'd better not [0]. > > Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that willlook a little ugly, I might need to think about a cleaner way to do this. Well, I suspect the AVX-512 version will pretty much always need the runtime check given that its not available on a lot of newer hardware and requires a bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL optimizations if built with -msse4.2 -mpclmul because we don't want to regress existing -msse4.2 users with a runtime check. -- nathan
> Well, I suspect the AVX-512 version will pretty much always need the runtime > check given that its not available on a lot of newer hardware and requires a > bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be > worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL > optimizations if built with -msse4.2 -mpclmul because we don't want to regress > existing -msse4.2 users with a runtime check. Sounds good to me. Although, users building with just -msse4.2 will now encounter an an additional pclmul runtime check. That would be a regression unless they update to building with both -msse4.2 and -mpclmul. Raghuveer
On Wed, Feb 12, 2025 at 10:12:20PM +0000, Devulapalli, Raghuveer wrote: >> Well, I suspect the AVX-512 version will pretty much always need the runtime >> check given that its not available on a lot of newer hardware and requires a >> bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be >> worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL >> optimizations if built with -msse4.2 -mpclmul because we don't want to regress >> existing -msse4.2 users with a runtime check. > > Sounds good to me. Although, users building with just -msse4.2 will now encounter an > an additional pclmul runtime check. That would be a regression unless they update to > building with both -msse4.2 and -mpclmul. My thinking was that building with just -msse4.2 would cause the existing SSE 4.2 implementation to be used (without the function pointer). That's admittedly a bit goofy because they'd miss out on the PCLMUL optimization, but things at least don't get any worse for them. -- nathan
> > Sounds good to me. Although, users building with just -msse4.2 will > > now encounter an an additional pclmul runtime check. That would be a > > regression unless they update to building with both -msse4.2 and -mpclmul. > > My thinking was that building with just -msse4.2 would cause the existing SSE 4.2 > implementation to be used (without the function pointer). That's admittedly a bit > goofy because they'd miss out on the PCLMUL optimization, but things at least > don't get any worse for them. Right. We are only talking about a regression for small potion of people who build with -msse4.2 and run on Nehalem/VM with pclmul disabled where we will run the cpuid check for pclmul and still pick the sse4.2 version. Raghuveer
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > I tried using branching for the runtime check, and this looks like the > way to go: > - Existing -msse4.2 builders will still call directly, but inside the > function there is a length check and only for long input will it do a > runtime check for pclmul. > - This smooths the way for -msse4.2 (and the equivalent on Arm) to > inline calls with short constant input (e.g. WAL insert lock), > although I've not done that here. > - This can be a simple starting point for consolidating runtime > checks, as was proposed for popcount in the AVX-512 CRC thread, but > with branching my model was Andres' sketch here: > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76one%40awork3.anarazel.de Oh, nifty. IIUC this should help avoid quite a bit of overhead even before adding the PCLMUL stuff since it removes the function pointers for the runtime-check builds. While this needn't block this patch set, I do find the dispatch code to be pretty complicated. Maybe we can improve that in the future by using macros to auto-generate much of it. My concern here is less about this particular patch set and more about the long term maintainability as we add more and more stuff like it, each with its own tangled web of build and dispatch rules. -- nathan
Hi John, I added dispatch logic for the new pclmul version on top of your v5 (which seems outdated now and so I will refrain fromposting a patch here). Could you take a look at the approach here to see if this makes sense? The logic in the pg_comp_crc32c_choosefunction is the main change and seems a lot cleaner to me. See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/ Raghuveer > -----Original Message----- > From: John Naylor <johncnaylorls@gmail.com> > Sent: Monday, February 17, 2025 10:40 PM > To: Nathan Bossart <nathandbossart@gmail.com> > Cc: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>; pgsql- > hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com> > Subject: Re: Improve CRC32C performance on SSE4.2 > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandbossart@gmail.com> > wrote: > > > > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > > > I tried using branching for the runtime check, and this looks like > > > the way to go: > > > - Existing -msse4.2 builders will still call directly, but inside > > > the function there is a length check and only for long input will it > > > do a runtime check for pclmul. > > > - This smooths the way for -msse4.2 (and the equivalent on Arm) to > > > inline calls with short constant input (e.g. WAL insert lock), > > > although I've not done that here. > > > - This can be a simple starting point for consolidating runtime > > > checks, as was proposed for popcount in the AVX-512 CRC thread, but > > > with branching my model was Andres' sketch here: > > > > > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76on > > > e%40awork3.anarazel.de > > > > Oh, nifty. IIUC this should help avoid quite a bit of overhead even > > before adding the PCLMUL stuff since it removes the function pointers > > for the runtime-check builds. > > I figured the same, but it doesn't seem to help on the microbenchmark. > (I also tried the pg_waldump benchmark you used, but I couldn't get it working so > I'm probably missing a step.) > > master: > > 20 > latency average = 3.958 ms > latency average = 3.860 ms > latency average = 3.857 ms > 32 > latency average = 5.921 ms > latency average = 5.166 ms > latency average = 5.128 ms > 48 > latency average = 6.384 ms > latency average = 6.022 ms > latency average = 5.991 ms > > v6: > > 20 > latency average = 4.084 ms > latency average = 3.879 ms > latency average = 3.896 ms > 32 > latency average = 5.533 ms > latency average = 5.536 ms > latency average = 5.573 ms > 48 > latency average = 6.201 ms > latency average = 6.205 ms > latency average = 6.111 ms > > > While this needn't block this patch set, I do find the dispatch code > > to be pretty complicated. Maybe we can improve that in the future by > > using macros to auto-generate much of it. My concern here is less > > about this particular patch set and more about the long term > > maintainability as we add more and more stuff like it, each with its > > own tangled web of build and dispatch rules. > > I think the runtime dispatch is fairly simple for this case. To my taste, the worse > maintainability hazard is the building. To make that better, I'd do this: > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them > unconditionally for each platform > - Initialize the runtime info by CPU platform and not other symbols where possible > (I guess anything needing AVX-512 will still be a mess) > - Put all hardware CRC .c files into a single file pg_crc32c_hw.c. > Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use > by platform and have a separate pg_crc32c_hw_impl.h header that has the > simple loop with these macros. That header would for now only be included into > pg_crc32c_hw.c. > > The first two could be done as part of this patch or soon after. The third is a bit > more invasive, but seems like a necessary prerequisite for inlining on small > constant input, to keep that from being a mess. > > There's another potential irritation for maintenance too: I spent too much time > only adding pg_cpucap_initialize() to frontend main() functions that need it. I > should have just added them to every binary. > We don't add new programs often, but it's still less automatic than the function > pointer way, so I'm open to other ideas. > > Attached v7 to fix CI failures. > -- > John Naylor > Amazon Web Services
Hi John, Just a few comments on v7: > pg_cpucap_crc32c I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy toextend in the future and keeps the functionalities separate. > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them > unconditionally for each platform +1. Sounds perfect. We should also move the avx512 runtime detection of popcount here. The runtime detection code could alsobe appended with function __attribute__((constructor)) so that it gets executed before main. > I think the runtime dispatch is fairly simple for this case. + pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) + { + .... + #ifdef HAVE_PCLMUL_RUNTIME + if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL)) IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of thepg_crc32c.h to it own pg_crc32c_dispatch.c file. Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happy toprovide feedback and review your work, or can continue development work based on any feedback. Please let me know whichyou'd prefer. Raghuveer
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > The runtime detection code could also be appended with function __attribute__((constructor)) so that it gets executed beforemain. Hmm! I didn't know about that, thanks! It works on old gcc/clang, so that's good. I can't verify online if Arm etc. executes properly since the execute button is greyed out, but I don't see why it wouldn't: https://godbolt.org/z/3as9MGr3G Then we could have: #ifdef FRONTEND pg_attribute_constructor() #endif void pg_cpucap_initialize(void) { ... } Now, there is no equivalent on MSVC and workarounds are fragile [1]. Maybe we could only assert initialization happened for the backend and for frontend either - add a couple manual initializations to to the frontend programs where we don't want to lose performance for non-gcc/clang. - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas M.'s earlier findings on popcount (also SSE4.2) [2] The first is less risky but less tidy. > > I think the runtime dispatch is fairly simple for this case. > + pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) > + { > + .... > + #ifdef HAVE_PCLMUL_RUNTIME > + if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL)) > > IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of thepg_crc32c.h to it own pg_crc32c_dispatch.c file. That makes sense, but it does close the door on future inlining. > Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happyto provide feedback and review your work, or can continue development work based on any feedback. Please let me knowwhich you'd prefer. Sure. Depending on any feedback on the above constructor technique, I'll work on the following, since I've prototyped most of the first and the second is mostly copy-and-paste from your earlier work: > > pg_cpucap_crc32c > > I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy toextend in the future and keeps the functionalities separate. > > > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them > > unconditionally for each platform > > +1. Sounds perfect. We should also move the avx512 runtime detection of popcount here. [1] https://stackoverflow.com/questions/1113409/attribute-constructor-equivalent-in-vc [2] https://www.postgresql.org/message-id/CA+hUKGKS64zJezV9y9mPcB-J0i+fLGiv3FAdwSH_3SCaVdrjyQ@mail.gmail.com -- John Naylor Amazon Web Services
> Now, there is no equivalent on MSVC and workarounds are fragile [1]. > Maybe we could only assert initialization happened for the backend and for > frontend either > - add a couple manual initializations to to the frontend programs where we don't > want to lose performance for non-gcc/clang. > - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas > M.'s earlier findings on popcount (also SSE4.2) [2] > > The first is less risky but less tidy. Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my strong suit:/ Raghuveer
On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > > > Now, there is no equivalent on MSVC and workarounds are fragile [1]. > > Maybe we could only assert initialization happened for the backend and for > > frontend either > > - add a couple manual initializations to to the frontend programs where we don't > > want to lose performance for non-gcc/clang. > > - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas > > M.'s earlier findings on popcount (also SSE4.2) [2] > > > > The first is less risky but less tidy. > > Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my strongsuit :/ Here's another idea to make it more automatic: Give up on initializing every capability at once. The first time we call CRC, it will be uninitialized, so this part: if (pg_cpucap & PGCPUCAP_CRC32C) return COMP_CRC32C_HW(crc, data, len); else return pg_comp_crc32c_sb8(crc, data, len); ...will call the SB8 path. Inside there, do the check: #if defined(HAVE_CRC_RUNTIME) // separate init bit for each capability if (unlikely(pg_cpucap & PGCPUCAP_CRC32C_INIT == 0)) { pg_cpucap_crc32c(); // also sets PGCPUCAP_CRC32C_INIT if (pg_cpucap & PGCPUCAP_CRC32C) return COMP_CRC32C_HW(crc, data, len); } #endif // ...fallthrough to SB8 -- John Naylor Amazon Web Services
> Here's another idea to make it more automatic: Give up on initializing every > capability at once. I'm not sure I like giving up this. Initializing and running CPUID check with the attribute constructor is very valuablefor two reasons: (1) you get everything done at load time before main and (2) you don’t have to run cpuid check forevery feature (popcount, crc32c, or anything else you add in the future) multiple times. It keep the cpuid functionalityin a central place that makes it a modular design. On MSVC, we could have the first SIMD feature call pg_cpucap_initialize() which runs CPUID stores the cpu features. Any subsequentcall can skip (because it has already been initialized) by using a static variable or some other approach. Doesthis make sense? Raghuveer
On Tue, Feb 25, 2025 at 3:17 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > > Here's another idea to make it more automatic: Give up on initializing every > > capability at once. > > I'm not sure I like giving up this. Initializing and running CPUID check with the attribute constructor is very valuablefor two reasons: (1) you get everything done at load time before main and (2) you don’t have to run cpuid check forevery feature (popcount, crc32c, or anything else you add in the future) multiple times. It keep the cpuid functionalityin a central place that makes it a modular design. I agree it would be preferable to make a centralized check work. > On MSVC, we could have the first SIMD feature call pg_cpucap_initialize() which runs CPUID stores the cpu features. Anysubsequent call can skip (because it has already been initialized) by using a static variable or some other approach.Does this make sense? Correct me if I'm misunderstanding, but this sounds like in every frontend program we'd need to know what the first call was, which seems less maintainable than just initializing at the start of every frontend program. -- John Naylor Amazon Web Services
On Tue, Feb 18, 2025 at 1:40 PM John Naylor <johncnaylorls@gmail.com> wrote: > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: > > While this needn't block this patch set, I do find the dispatch code to be > > pretty complicated. Maybe we can improve that in the future by using > > macros to auto-generate much of it. My concern here is less about this > > particular patch set and more about the long term maintainability as we add > > more and more stuff like it, each with its own tangled web of build and > > dispatch rules. I had a further thought on this: CRC and non-vector popcount are kind of special in that recent OSes assume they exist, and it's worth a bit of effort to take advantage of that. Other things we may add should be kept as simple as possible. > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build > them unconditionally for each platform > - Initialize the runtime info by CPU platform and not other symbols > where possible (I guess anything needing AVX-512 will still be a mess) I've made a start of this for v8: 0001 is mostly the same as before 0002 (Meson-only for now) changes 0001 per the above, to see how it looks, but I've not tried to add popcount or anything else. I like it overall, but some details may need tweaking. 0004 generates the pclmul loop slightly differently to simplify integrating with our code, but shouldn't make a big difference Another thing I found in Agner's manuals: AMD Zen, even as recently as Zen 4, don't have as good a microarchitecture for PCLMUL, so if anyone with such a machine would like to help test the cutoff, the script is at https://www.postgresql.org/message-id/CANWCAZahvhE-%2BhtZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ%40mail.gmail.com (needs "CREATE EXTENSION test_crc32c;" to run it) -- John Naylor Amazon Web Services
Attachment
> I agree it would be preferable to make a centralized check work. Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the AVX512popcount code. Any new feature that requires CPUID information can access that information with pg_cpu_have[FEATURE]defined in pg_cpucap.h and initialized in pg_cpucap.c. v8 also had a typo in configure files which causeda build failure. Its fixed in v9. Pretty sure the ARM code paths need some correction. Let me know what you think. > Correct me if I'm misunderstanding, but this sounds like in every frontend > program we'd need to know what the first call was, which seems less > maintainable than just initializing at the start of every frontend program. No, sorry for the confusion but that is not what I meant. Lets ignore the attribute constructor for now. We can probablyrevisit this at a later point. Raghuveer
Attachment
On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > > I agree it would be preferable to make a centralized check work. > > Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the AVX512popcount code. Any new feature that requires CPUID information can access that information with pg_cpu_have[FEATURE]defined in pg_cpucap.h and initialized in pg_cpucap.c. v8 also had a typo in configure files which causeda build failure. Its fixed in v9. > > Pretty sure the ARM code paths need some correction. Let me know what you think. Thanks, I think this is a good direction. Some comments: +static void pg_cpuid(int leaf, int subleaf, unsigned int* exx) +{ +#if defined(HAVE__GET_CPUID_COUNT) + __get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUIDEX) + __cpuidex(exx, leaf, subleaf); +#else +#error cpuid instruction not available +#endif +} Our current configure still looks for __get_cpuid and __cpuid. We committed checking these new ones fairly recently, and they were further gated by USE_AVX512_POPCNT_WITH_RUNTIME_CHECK. It seems like here we should do something like the following, where "+" lines are from the patch and other lines are mine: +static void +pg_cpucap_x86(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; #if defined(HAVE__GET_CPUID) __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID) __cpuid(exx, 1); #endif + pg_cpucap[PG_CPU_FEATURE_SSE42] = (exx[2] & (1 << 20)) != 0; + pg_cpucap[PG_CPU_FEATURE_PCLMUL] = (exx[2] & (1 << 1)) != 0; + pg_cpucap[PG_CPU_FEATURE_POPCNT] = (exx[2] & (1 << 23)) != 0; + /* osxsave */ + if ((exx[2] & (1 << 27)) == 0) { + return; + } + /* avx512 os support */ + if (zmm_regs_available()) { + return; + } // BTW, I do like the gating here that reduces the number of places that have to know about zmm and xsave. (Side note: shouldn't that be "if !(zmm_regs_available())"?) + /* reset for second cpuid call on leaf 7 to check extended avx512 support */ exx[4] = {0, 0, 0, 0}; #if defined(HAVE__GET_CPUID_COUNT) __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUIDEX) __cpuidex(exx, 7, 0); #endif + pg_cpucap[PG_CPU_FEATURE_AVX512F] = (exx[1] & (1 << 16)) != 0; + pg_cpucap[PG_CPU_FEATURE_AVX512BW] = (exx[1] & (1 << 30)) != 0; + pg_cpucap[PG_CPU_FEATURE_AVX512VPOPCNTDQ] = (exx[2] & (1 << 14)) != 0; + +} What do you think? -#define PGCPUCAP_INIT (1 << 0) -#define PGCPUCAP_POPCNT (1 << 1) -#define PGCPUCAP_VPOPCNT (1 << 2) -#define PGCPUCAP_CRC32C (1 << 3) -#define PGCPUCAP_CLMUL (1 << 4) +enum pg_cpucap__ +{ + PG_CPU_FEATURE_INIT = 0, + // X86 + PG_CPU_FEATURE_SSE42 = 1, + PG_CPU_FEATURE_POPCNT = 2, + PG_CPU_FEATURE_PCLMUL = 3, [...] + PG_CPU_FEATURE_ARMV8_CRC32C = 100, I'm not a fan of exposing these architecture-specific details to places that consult the capabilities. That requires things like +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) [...] +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C) I'd prefer to have 1 capability <-> one place to check at runtime for architectures that need that, and to keep architecture details private to the initialization step. . Even for things that test for which function pointer to use, I think it's a cleaner interface to look at one thing. +static void +pg_cpucap_arch() +{ + /* WIP: configure checks */ +#ifdef __x86_64__ + pg_cpucap_x86(); +#else // ARM: + pg_cpucap_arm(); +#endif +} If we're going to have a single file for the init step, we don't need this -- we'd just have a different definition of pg_cpucap_initialize() in each part, with a default that only adds the "init" slot: #if defined( __i386__ ) || defined(i386) || defined(_M_IX86) || defined(__x86_64__) || defined(__x86_64) || defined(_M_AMD64) <cpuid checks> #elif defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(_M_ARM64) <for now only pg_crc32c_armv8_available()> #else <only init> #endif -- John Naylor Amazon Web Services
Attached v10 with minor changes based on the feedback. > Thanks, I think this is a good direction. Some comments: > > #if defined(HAVE__GET_CPUID) > __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID) > __cpuid(exx, 1); #endif Oh yeah good catch. Fixed in v10. > + } > + /* avx512 os support */ > + if (zmm_regs_available()) { > + return; > + } > > // BTW, I do like the gating here that reduces the number of places that have to > know about zmm and xsave. (Side note: shouldn't that be "if > !(zmm_regs_available())"?) Yup, good catch again 😊 > What do you think? Yup, those look correct to me. Fixed them in v10. > + PG_CPU_FEATURE_PCLMUL = 3, > [...] > + PG_CPU_FEATURE_ARMV8_CRC32C = 100, > > I'm not a fan of exposing these architecture-specific details to places that consult > the capabilities. That requires things like > > +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > [...] > +#define PGCPUCAP_CRC32C > pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C) > > I'd prefer to have 1 capability <-> one place to check at runtime for architectures > that need that, and to keep architecture details private to the initialization step. . > Even for things that test for which function pointer to use, I think it's a cleaner > interface to look at one thing. Isn't that one thing currently pg_cpu_have(FEATURE)? The reason we have the additional PGCPUCAP_CRC32C is because the dispatchfor CRC32C is currently defined in the header file common to both ARM and SSE42. We should ideally have separatedispatch for ARM and x86 in their own files (the way it is in the main branch). This also makes it easier to addmore implementations in the future without having to make the dispatch function work for both ARM and x86. > If we're going to have a single file for the init step, we don't need this -- we'd just > have a different definition of > pg_cpucap_initialize() in each part, with a default that only adds the "init" slot: > > #if defined( __i386__ ) || defined(i386) || defined(_M_IX86) || > defined(__x86_64__) || defined(__x86_64) || defined(_M_AMD64) > > <cpuid checks> > > #elif defined(__arm__) || defined(__arm) || > defined(__aarch64__) || defined(_M_ARM64) > > <for now only pg_crc32c_armv8_available()> > > #else > <only init> > #endif Makes sense. Got rid of it in v10. Raghuveer
Attachment
On Thu, Feb 27, 2025 at 3:53 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > I'm not a fan of exposing these architecture-specific details to places that consult > > the capabilities. That requires things like > > > > +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > > [...] > > +#define PGCPUCAP_CRC32C > > pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C) > > > > I'd prefer to have 1 capability <-> one place to check at runtime for architectures > > that need that, and to keep architecture details private to the initialization step. . > > Even for things that test for which function pointer to use, I think it's a cleaner > > interface to look at one thing. > > Isn't that one thing currently pg_cpu_have(FEATURE)? No, I mean the one thing would be "do we have hardware CRC?", and each architecture would set (and check if needed) the same slot. I'm not 100% sure this is the better way, and there may be a disadvantage I haven't thought of, but this makes the most sense to me. I'm willing to hear reasons to do it differently, but I'm thinking those reasons need to be weighed against the loss of abstraction. > The reason we have the additional PGCPUCAP_CRC32C is because the dispatch for CRC32C is currently defined in the headerfile common to both ARM and SSE42. We should ideally have separate dispatch for ARM and x86 in their own files (theway it is in the main branch). Just so we're on the same page, the current separate files do initialization, not dispatch. I'm seeing conflicting design goals here: 1. Have multiple similar dispatch functions with the only difference (for now) being certain names (not in the patch, just discussed). 2. Put initialization routines in the same file, even though they have literally nothing in common. > This also makes it easier to add more implementations in the future without having to make the dispatch function work forboth ARM and x86. I don't quite understand, since with 0001 it already works (at least in theory) for 3 architectures with hardware CRC, plus those without, and I don't see it changing with more architectures. +/* Access to pg_cpucap for modules that need runtime CPUID information */ +bool pg_cpu_have(int feature_id) +{ + return pg_cpucap[feature_id]; } Putting this in a separate translation may have to do the decision to turn v8's #defines into an enum? In any case, this means doing a function call to decide which function to call. That makes no sense. The goal of v9/10 was to centralize the initialization from cpuid etc, but it also did a major re-architecture of the runtime-checks at the same. That made review more difficult. +#if defined(__x86_64__) || defined ... + pg_cpucap_x86(); +#elif defined(__aarch64__) || defined ... + pg_cpucap_arm(); +#endif I view this another conflict in design goals: After putting everything in the same file, the routines still have different names and have to be guarded accordingly. I don't really see the advantage, especially since these guard symbols don't even match the ones that guard the actual initialization steps. I tried to imply that in my last review, but maybe I should have been more explicit. I think the least painful step is to take the x86 initialization from v10, which is looking great, but - keep separate initialization files - don't whack around the runtime representation, at least not in the same patch -- John Naylor Amazon Web Services
Hi John, Going back to your earlier comment: > > > I'm not a fan of exposing these architecture-specific details to > > > places that consult the capabilities. That requires things like +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > > Isn't that one thing currently pg_cpu_have(FEATURE)? > > No, I mean the one thing would be "do we have hardware CRC?", and each > architecture would set (and check if needed) the same slot. > > I'm not 100% sure this is the better way, and there may be a disadvantage I > haven't thought of, but this makes the most sense to me. I'm willing to hear > reasons to do it differently, but I'm thinking those reasons need to be weighed > against the loss of abstraction. IMO, CPU SIMD (SSE4.2, AVX, etc.) features are a module of their own separate from capabilities/features supported in postgres(CRC32C, bitcount, etc.). Exposing architecture-specific details to the features that need to check against themis a way to make code more modular and reusable. It is reasonable to expect developers writing SIMD specific functionsto naturally understand the runtime architecture requirements for that function which is easily accessible by justchecking the value of pg_cpu_have(PG_CPU_FEATURE_..). If another feature in postgres requires SSE4.2, then the CPU initializationmodule doesn't need to be modified at all. They just have to worry about their feature and its CPU requirements. > Just so we're on the same page, the current separate files do initialization, not > dispatch. I'm seeing conflicting design goals > here: > > 1. Have multiple similar dispatch functions with the only difference (for now) > being certain names (not in the patch, just discussed). > 2. Put initialization routines in the same file, even though they have literally > nothing in common. Sorry for not being clear. I will move the initialization routines to separate files. Just haven’t gotten to it yet withv10. > > > This also makes it easier to add more implementations in the future without > having to make the dispatch function work for both ARM and x86. > > I don't quite understand, since with 0001 it already works (at least in theory) for 3 > architectures with hardware CRC, plus those without, and I don't see it changing > with more architectures. The reason its working across all architectures as of now is because there is only one runtime check for CRC32C feature acrossx86, ARM and LongArch. (PCLMUL dispatch happens inside the SSE42). If and when we add the AVX-512 implementation, won'tthe dispatch logic will look different for x86 and ARM? Along with that, the header file pg_crc32c.h will also looka lot messier with a whole bunch of nested #ifdefs. IMO, the header file should be devoid of any architecture dispatchlogic and simply contain declarations of various implementations (see https://github.com/r-devulap/postgressql/blob/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/src/include/port/pg_crc32c.hfor example).The dispatch logic should be handled separately in a C file. > > +/* Access to pg_cpucap for modules that need runtime CPUID information > +*/ bool pg_cpu_have(int feature_id) { > + return pg_cpucap[feature_id]; > } > > Putting this in a separate translation may have to do the decision to turn v8's > #defines into an enum? In any case, this means doing a function call to decide > which function to call. That makes no sense. The reason it is a separate translational unit is because I intended pg_cpucap to be a read only variable outside of pg_cpucap.c.If the function overhead is not preferred, then I can get rid of it. > > The goal of v9/10 was to centralize the initialization from cpuid etc, but it also did > a major re-architecture of the runtime-checks at the same. That made review > more difficult. I assumed we wanted to move the runtime checks to a central place and that needed this re-architecture. > > +#if defined(__x86_64__) || defined ... > + pg_cpucap_x86(); > +#elif defined(__aarch64__) || defined ... > + pg_cpucap_arm(); > +#endif > > I view this another conflict in design goals: After putting everything in the same > file, the routines still have different names and have to be guarded accordingly. I > don't really see the advantage, especially since these guard symbols don't even > match the ones that guard the actual initialization steps. I tried to imply that in > my last review, but maybe I should have been more explicit. > > I think the least painful step is to take the x86 initialization from v10, which is > looking great, but > - keep separate initialization files > - don't whack around the runtime representation, at least not in the same patch Sure. I can fix that in v11. Raghuveer
Hi Raghuveer, You raised some interesting points, which deserve a thoughtful response. After sleeping on it, however I came to the conclusion that a sweeping change in runtime checks, with either of our approaches, has downsides and unresolved questions. Perhaps we can come back to it at a later time. For this release cycle, I took a step back and tried to think of the least invasive way to solve the immediate problem, which is: How to allow existing builds with "-msse4.2" to take advantage of CLMUL while not adding overhead. Here's what I came up with in v11: 0001: same benchmark module as before 0002: For SSE4.2 builds, arrange so that constant input uses an inlined path so that the compiler can emit unrolled loops anywhere. This is particularly important for the WAL insertion lock, so this is possibly committable on its own just for that. 0003: the PCLMUL path, only for runtime-check builds 0004: the PCLMUL path for SSE4.2 builds. This uses a function pointer for long-ish input and the same above inlined path for short input (whether constant or not). So it gets the best of both worlds. There is also a separate issue: On Tue, Feb 25, 2025 at 6:05 PM John Naylor <johncnaylorls@gmail.com> wrote: > > Another thing I found in Agner's manuals: AMD Zen, even as recently as > Zen 4, don't have as good a microarchitecture for PCLMUL, so if anyone > with such a machine would like to help test the cutoff David Rowley shared some results off-list, which are: Zen 4 is very good with this algorithm even at 64 bytes input length, but Zen 2 regresses up to maybe 256 bytes. Having a large cutoff to cover all bases makes this less useful, and that was one of my reservations about AVX-512. However, with the corsix generator I found it's possible to specify AVX-512 with a single accumulator (rather than 4), which still gives a minimum input of 64 bytes, so I'll plan to put something together to demonstrate. (Older machines could use the 3-way stream algorithm as a fallback on long inputs, as I've mentioned elsewhere, assuming that's legally unencumbered.) -- John Naylor Amazon Web Services
Attachment
On Fri, Feb 28, 2025 at 07:11:29PM +0700, John Naylor wrote: > 0002: For SSE4.2 builds, arrange so that constant input uses an > inlined path so that the compiler can emit unrolled loops anywhere. > This is particularly important for the WAL insertion lock, so this is > possibly committable on its own just for that. Nice. > 0004: the PCLMUL path for SSE4.2 builds. This uses a function pointer > for long-ish input and the same above inlined path for short input > (whether constant or not). So it gets the best of both worlds. I spent some time staring at pg_crc32.h with all these patches applied, and IIUC it leads to the following behavior: * For compiled-in SSE 4.2 builds, we branch based on the length. For smaller inputs, we are using an inlined version of the SSE 4.2 code. For larger inputs, we call a function pointer so that we can potentially use the PCLMUL version. This could potentially lead to a small regression for machines with SSE 4.2 but not PCLMUL, but that may be uncommon enough at this point to not worry aobut. * For runtime-check SSE 4.2 builds, we choose slicing-by-8, SSE 4.2, or SSE 4.2 with PCLMUL, and we always use a function pointer. The main question I have is whether we can simplify this by always using a runtime check and by inlining slicing-by-8 for small inputs. That would be dependent on the performance of slicing-by-8 and SSE 4.2 being comparable for small inputs. Overall, I wish we could avoid splitting things into separate files and adding more header file gymnastics, but maybe there isn't much better we can do without overhauling the CPU feature detection code. -- nathan
Hi John, > You raised some interesting points, which deserve a thoughtful response. After > sleeping on it, however I came to the conclusion that a sweeping change in > runtime checks, with either of our approaches, has downsides and unresolved > questions. Perhaps we can come back to it at a later time. Sounds good to me. I did get derailed into something beyond the scope of this patch. I will make a separate proposal oncethis is merged. > Here's what I came up with in v11: Some feedback on v11: if ((exx[2] & (1 << 20)) != 0) /* SSE 4.2 */ { pg_comp_crc32c = pg_comp_crc32c_sse42; #ifdef USE_PCLMUL_WITH_RUNTIME_CHECK if ((exx[2] & (1 << 1)) != 0) /* PCLMUL */ pg_comp_crc32c = pg_comp_crc32c_pclmul; #endif } #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK else pg_comp_crc32c = pg_comp_crc32c_sb8; #endif Is the #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK at the right place? Shouldn’t it guard SSE4.2 function pointer assignment? /* WIP: configure checks */ #ifdef __x86_64__ #define USE_PCLMUL_WITH_RUNTIME_CHECK #endif Minor consideration: gcc 4.3 (released in 2011) is the only compiler that supports -msse4.2 and not -mpclmul. gcc >= 4.4supports both. If you are okay causing a regression on gcc4.3, we could combine USE_PCLMUL_WITH_RUNTIME_CHECK with USE_SSE42_CRC32C_WITH_RUNTIME_CHECKinto a single macro to reduce the number of #ifdef's in the codebase and simplify configure/mesoncompiler checks. > 0001: same benchmark module as before > 0002: For SSE4.2 builds, arrange so that constant input uses an inlined path so > that the compiler can emit unrolled loops anywhere. When building with meson, it looks like we build with -O2 and that is not enough for the compiler to unroll the SSE42 CRC32Cloop. It requires -O3 or -O2 with -funroll-loops (see https://gcc.godbolt.org/z/4Eaq981aT). Perhaps we should checkdisassembly to see if the unroll is really happening on constant input? Also, the reason you have pg_comp_crc32c_sse42_inline defined separately in a header file is because you want to (a) inlinethe function and (b) unroll for constant inputs. Couldn't both of these be achieved by adding function __attribute__((always_inline))on the pg_comp_crc32c_sse42 function with the added -funroll-loops compiler flag? Raghuveer
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > I spent some time staring at pg_crc32.h with all these patches applied, and > IIUC it leads to the following behavior: > > * For compiled-in SSE 4.2 builds, we branch based on the length. For > smaller inputs, we are using an inlined version of the SSE 4.2 code. > For larger inputs, we call a function pointer so that we can potentially > use the PCLMUL version. Right. For WAL, my hope is that the inlined path would balance out the path with the function pointer, particularly since the computation for the 20-byte header would be both inlined and unrolled, as I see here in XLogInsertRecord(): crc32 rax,QWORD PTR [rsi] crc32 rax,rbx ; <- newly calculated xl_prev crc32 eax,DWORD PTR [rsi+0x10] > This could potentially lead to a small > regression for machines with SSE 4.2 but not PCLMUL, but that may be > uncommon enough at this point to not worry aobut. Note also upthread I mentioned we may have to go to 512-bit pclmul, since Zen 2 regresses on 128-bit. :-( I actually haven't seen any measurable difference with direct calls versus indirect, but it could very well be that the microbenchmark is hiding that since it's doing something unnatural by calling things a bunch of times in a loop. I want to try changing the benchmark to base the address it's computing on some bits from the crc from the last loop iteration. I think that would make it more latency-sensitive. We could also make it do an additional constant 20-byte input every time to make it resemble WAL more closely. > * For runtime-check SSE 4.2 builds, we choose slicing-by-8, SSE 4.2, or > SSE 4.2 with PCLMUL, and we always use a function pointer. > > The main question I have is whether we can simplify this by always using a > runtime check and by inlining slicing-by-8 for small inputs. That would be > dependent on the performance of slicing-by-8 and SSE 4.2 being comparable > for small inputs. Slicing-by-8 needs one lookup and one XOR per byte of input, and other overheads, so I think it would still be very slow. > Overall, I wish we could avoid splitting things into separate files and > adding more header file gymnastics, but maybe there isn't much better we > can do without overhauling the CPU feature detection code. Yeah, it seems all ideas so far have something unattractive about them. :-( -- John Naylor Amazon Web Services
On Tue, Mar 4, 2025 at 5:41 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > Some feedback on v11: > > if ((exx[2] & (1 << 20)) != 0) /* SSE 4.2 */ > { > pg_comp_crc32c = pg_comp_crc32c_sse42; > #ifdef USE_PCLMUL_WITH_RUNTIME_CHECK > if ((exx[2] & (1 << 1)) != 0) /* PCLMUL */ > pg_comp_crc32c = pg_comp_crc32c_pclmul; > #endif > } > #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK > else > pg_comp_crc32c = pg_comp_crc32c_sb8; > #endif > > Is the #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK at the right place? Shouldn’t it guard SSE4.2 function pointer assignment? Without this guard, SSE builds will complain during link time of an undefined reference to pg_comp_crc32c_sb8. We could instead just build that file everywhere for simplicity, but it'll just take up space and never get called. (Maybe that's okay because with the runtime-branching approach we experimented with earlier, it would always have to be built.) > When building with meson, it looks like we build with -O2 and that is not enough for the compiler to unroll the SSE42 CRC32Cloop.It requires -O3 or -O2 with -funroll-loops (see https://gcc.godbolt.org/z/4Eaq981aT). Perhaps we should checkdisassembly to see if the unroll is really happening on constant input? That example uses 32 bytes -- fiddling with it a bit, for 31 or less it gets unrolled. The case we care about most is currently 20 bytes. We don't want to force unrolling loops in the general case -- that's normally used for longer input and this is for short input. > Also, the reason you have pg_comp_crc32c_sse42_inline defined separately in a header file is because you want to (a) inlinethe function and (b) unroll for constant inputs. Couldn't both of these be achieved by adding function __attribute__((always_inline))on the pg_comp_crc32c_sse42 function with the added -funroll-loops compiler flag? And (c) prevent it from being inlined in builds that need a runtime check. I briefly tried the attribute approach and it doesn't work for me. If you can get it to work, go ahead and share how that's done, but keep in mind that we're not gcc/clang only -- it also has to work for MSVC's "__forceinline"... -- John Naylor Amazon Web Services
On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote: > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> This could potentially lead to a small regression for machines with SSE >> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not >> worry aobut. > > Note also upthread I mentioned we may have to go to 512-bit pclmul, > since Zen 2 regresses on 128-bit. :-( Ah, okay. You mean the AVX-512 version [0]? And are you thinking we'd use the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the SSE4.2 version for small inputs and use a function pointer for larger ones? > I actually haven't seen any measurable difference with direct calls > versus indirect, but it could very well be that the microbenchmark is > hiding that since it's doing something unnatural by calling things a > bunch of times in a loop. I want to try changing the benchmark to base > the address it's computing on some bits from the crc from the last > loop iteration. I think that would make it more latency-sensitive. We > could also make it do an additional constant 20-byte input every time > to make it resemble WAL more closely. Looking back on some old benchmarks for small-ish inputs [0], the difference does seem within the noise range. I suppose these functions might be expensive enough to make the function pointer overhead negligible. IME there's a big difference when a function pointer is used for an instruction or two [2], but even relatively small inputs to the CRC-32C functions might require several instructions. >> The main question I have is whether we can simplify this by always using a >> runtime check and by inlining slicing-by-8 for small inputs. That would be >> dependent on the performance of slicing-by-8 and SSE 4.2 being comparable >> for small inputs. > > Slicing-by-8 needs one lookup and one XOR per byte of input, and other > overheads, so I think it would still be very slow. That's my suspicion, too. [0] https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192%40BL1PR11MB5304.namprd11.prod.outlook.com [1] https://postgr.es/m/20231031033601.GA68409%40nathanxps13 [2] https://postgr.es/m/CAApHDvqyMNGVgwpaOPtENdq5uEMR%3DvSkRJEgG1S%2BX7Vtk1-EnA%40mail.gmail.com -- nathan
On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote: > > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> This could potentially lead to a small regression for machines with SSE > >> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not > >> worry aobut. > > > > Note also upthread I mentioned we may have to go to 512-bit pclmul, > > since Zen 2 regresses on 128-bit. :-( > > Ah, okay. You mean the AVX-512 version [0]? Right, except not that version, rather a more efficient way and with only one accumulator, so still a minimum length of 64 bytes. I'll share that once we have agreement on detection/dispatch. > And are you thinking we'd use > the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the > SSE4.2 version for small inputs and use a function pointer for larger ones? Yes. Although, we may not even have to inline for non-constant input, see below. Inlining loops does take binary space. > > I actually haven't seen any measurable difference with direct calls > > versus indirect, but it could very well be that the microbenchmark is > > hiding that since it's doing something unnatural by calling things a > > bunch of times in a loop. I want to try changing the benchmark to base > > the address it's computing on some bits from the crc from the last > > loop iteration. I think that would make it more latency-sensitive. We > > could also make it do an additional constant 20-byte input every time > > to make it resemble WAL more closely. > > Looking back on some old benchmarks for small-ish inputs [0], the > difference does seem within the noise range. I suppose these functions > might be expensive enough to make the function pointer overhead negligible. > IME there's a big difference when a function pointer is used for an > instruction or two [2], but even relatively small inputs to the CRC-32C > functions might require several instructions. That was my hunch too, but I wanted to be more sure, so I modified the benchmark so it doesn't know the address of the next calculation until it finishes the last calculation so we can hopefully see the latency caused by indirection. It also does an additional calculation on constant 20 bytes, like the WAL header. I also tweaked the length each iteration so the branch predictor maybe has a harder time predicting the constant 20 input. And to make it more challenging, I removed the part that inlined all small inputs, so it inlines only constant inputs: 0001+0002 (test only) func pointer: 32 latency average = 24.021 ms latency average = 24.020 ms latency average = 23.733 ms 40 latency average = 25.018 ms latency average = 24.253 ms latency average = 24.278 ms 48 latency average = 25.437 ms latency average = 24.817 ms latency average = 24.793 ms SSE4.2 build (direct func): 32 latency average = 22.422 ms latency average = 22.387 ms latency average = 22.391 ms 40 latency average = 23.444 ms latency average = 22.887 ms latency average = 22.988 ms 48 latency average = 23.432 ms latency average = 23.380 ms latency average = 23.384 ms 0001-0006 SSE 4.2 build (inlined constant / otherwise func pointer) 32 latency average = 22.135 ms latency average = 21.874 ms latency average = 21.910 ms 40 latency average = 22.916 ms latency average = 23.086 ms latency average = 22.422 ms 48 latency average = 23.255 ms latency average = 22.780 ms latency average = 22.804 ms These are still a bit noisy, and close, but, it seems there is no penalty in using the function pointer as long as the header calculation is inlined. -- John Naylor Amazon Web Services
On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote: > That was my hunch too, but I wanted to be more sure, so I modified the > benchmark so it doesn't know the address of the next calculation until > it finishes the last calculation so we can hopefully see the latency > caused by indirection. It also does an additional calculation on > constant 20 bytes, like the WAL header. I also tweaked the length each > iteration so the branch predictor maybe has a harder time predicting > the constant 20 input. And to make it more challenging, I removed the > part that inlined all small inputs, so it inlines only constant > inputs: Would you mind sharing this test? It sounds like you are running a workload with a mix of constant/inlined calls and function pointer calls to simulate typical usage for WAL, but I'm not 100% sure I'm understanding you correctly. > These are still a bit noisy, and close, but, it seems there is no > penalty in using the function pointer as long as the header > calculation is inlined. These results look promising. -- nathan
On Wed, Mar 5, 2025 at 10:52 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote: > > That was my hunch too, but I wanted to be more sure, so I modified the > > benchmark so it doesn't know the address of the next calculation until > > it finishes the last calculation so we can hopefully see the latency > > caused by indirection. It also does an additional calculation on > > constant 20 bytes, like the WAL header. I also tweaked the length each > > iteration so the branch predictor maybe has a harder time predicting > > the constant 20 input. And to make it more challenging, I removed the > > part that inlined all small inputs, so it inlines only constant > > inputs: > > Would you mind sharing this test? The test script is the same as here, except I only ran small lengths: https://www.postgresql.org/message-id/CANWCAZahvhE-%2BhtZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ%40mail.gmail.com ...but I must have forgotten to attach the slightly tweaked patch set, which I've done now. 0002 modifies the 0001 test module and 0006 reverts inlining non-constant input from 0005, just to see if I could find a regression from indirection, which I didn't. If we don't need it, it'd better to avoid inlining loops to keep from bloating the binary. > It sounds like you are running a > workload with a mix of constant/inlined calls and function pointer calls to > simulate typical usage for WAL, but I'm not 100% sure I'm understanding you > correctly. Exactly. -- John Naylor Amazon Web Services
Attachment
- v12-0006-Only-inline-for-constant-input-partial-revert.patch
- v12-0004-Improve-CRC32C-performance-on-x86_64.patch
- v12-0003-Inline-CRC-computation-for-small-fixed-length-in.patch
- v12-0005-Use-runtime-check-even-when-we-have-SSE-4.2-at-c.patch
- v12-0002-Attempt-to-make-benchmark-more-sensitive-to-late.patch
- v12-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmark.patch
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Overall, I wish we could avoid splitting things into separate files and > adding more header file gymnastics, but maybe there isn't much better we > can do without overhauling the CPU feature detection code. I wanted to make an attempt to make this aspect nicer. v13-0002 incorporates deliberately compact and simple loops for inlined constant input into the dispatch function, and leaves the existing code alone. This avoids code churn and saves vertical space in the copied code. It needs a bit more commentary, but I hope this is a more digestible prerequisite to the CLMUL algorithm -- as a reminder, it'll be simpler if we can always assume non-constant input can go through a function pointer. I've re-attached the modified perf test from v12 just in case anyone wants to play with it (v13-0003), but named so that the CF bot can't find it, since it breaks the tests in the original perf test (It's not for commit anyway). Adding back AVX-512 should be fairly mechanical, since Raghuveer and Nathan have already done the work needed for that. -- John Naylor Amazon Web Services
Attachment
On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote: > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> Overall, I wish we could avoid splitting things into separate files and >> adding more header file gymnastics, but maybe there isn't much better we >> can do without overhauling the CPU feature detection code. > > I wanted to make an attempt to make this aspect nicer. v13-0002 > incorporates deliberately compact and simple loops for inlined > constant input into the dispatch function, and leaves the existing > code alone. This avoids code churn and saves vertical space in the > copied code. It needs a bit more commentary, but I hope this is a more > digestible prerequisite to the CLMUL algorithm -- as a reminder, it'll > be simpler if we can always assume non-constant input can go through a > function pointer. That is certainly more readable. FWIW I think it would be entirely reasonable to replace the pg_crc32c_sse42.c implementation with a call to this new pg_comp_crc32c_dispatch() function. Of course, you'd have to split things up like: pg_attribute_no_sanitize_alignment() static inline pg_crc32c pg_comp_crc32c_sse42_inline(pgcrc32c crc, const void *data, size_t len) { const unsigned char *p = data; #ifdef __x86_64__ for (; len >= 8; p += 8, len -= 8) crc = _mm_crc32_u64(crc, *(const uint64 *) p); #endif for (; len >= 4; p += 4, len -= 4) crc = _mm_crc32_u32(crc, *(const uint32 *) p); for (; len > 0; len--) crc = _mm_crc32_u8(crc, *p++) return crc; } pg_attribute_no_sanitize_alignment() static inline pg_crc32c pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len) { if (__builtin_constant_p(len)) return pg_comp_crc32c_sse42_inline(crc, data, len); return pg_comp_crc32c_sse42(crc, data, len); } And then in pg_crc32c_sse42.c: pg_attribute_no_sanitize_alignment() pg_attribute_target("sse4.2") pg_crc32c pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) { return pg_comp_crc32c_sse42_inline(crc, data, len); } Granted, that adds back some of the complexity you were trying to avoid (and is very similar to your v12 patches), but it's pretty compact and avoids some code duplication. FTR I don't feel too strongly about this, but on balance I guess I'd be okay with a tad more complexity than your v13 patches if it served a useful purpose. -- nathan
On Tue, Mar 11, 2025 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote: > > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> Overall, I wish we could avoid splitting things into separate files and > >> adding more header file gymnastics, but maybe there isn't much better we > >> can do without overhauling the CPU feature detection code. > > > > I wanted to make an attempt to make this aspect nicer. v13-0002 > > incorporates deliberately compact and simple loops for inlined > > constant input into the dispatch function, and leaves the existing > > code alone. This avoids code churn and saves vertical space in the > > copied code. It needs a bit more commentary, but I hope this is a more > > digestible prerequisite to the CLMUL algorithm -- as a reminder, it'll > > be simpler if we can always assume non-constant input can go through a > > function pointer. > > That is certainly more readable. FWIW I think it would be entirely > reasonable to replace the pg_crc32c_sse42.c implementation with a call to > this new pg_comp_crc32c_dispatch() function. Of course, you'd have to > split things up like: > [snip] That could work as well. I'm thinking if we do PMULL on Arm, it might be advantageous to keep the inline path and function paths with distinct coding -- because of the pickier alignment on that platform, it might not be worth pre-aligning the pointer to 8 bytes for a 20-byte constant input. I've gone ahead and added the generated AVX-512 algorithm in v14-0005, and added the build support and some of the runtime support from Paul and Raghuveer's earlier patches in 0006-7. It passes CI, but I'll have to arrange access to other hardware to verify the runtime behavior. I think the Meson support is most of the way there, but it looks like configure.ac got whacked around cosmetically quite a bit. If we feel it's time to refactor things there, we'll want to split that out. In any case, for autoconf I've pretty much kept the earlier work for now. -- John Naylor Amazon Web Services
Attachment
- v14-0004-Always-do-runtime-check-for-x86-to-simplify-PCLM.patch
- v14-0008-Temp-fixup-build-of-benchmark-on-Windows.patch
- v14-0007-AVX-512-CRC-autoconf.patch
- v14-0005-Add-runtime-support-for-AVX-512-CRC.patch
- v14-0006-AVX-512-CRC-Meson.patch
- v14-0002-Inline-CRC-computation-for-fixed-length-input.patch
- v14-0003-Improve-CRC32C-performance-on-x86_64.patch
- v14-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmark.patch
> I've gone ahead and added the generated AVX-512 algorithm in v14-0005 + pg_attribute_target("avx512vl,vpclmulqdq") + pg_crc32c + pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length) I'm a little confused. Why is v14 missing the earlier versions of pclmul implementation that works without avx-512? Raghuveer
On Wed, Mar 12, 2025 at 3:57 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > > I've gone ahead and added the generated AVX-512 algorithm in v14-0005 > I'm a little confused. Why is v14 missing the earlier versions of pclmul implementation that works without avx-512? As I mentioned upthread, the 128-bit implementation regresses on Zen 2 up to at least 256 bytes. -- John Naylor Amazon Web Services
> > I've gone ahead and added the generated AVX-512 algorithm in > > v14-0005 Here is my benchmark numbers (master v/s v14) on TGL (time in ms): Buf-size-bytes Master v14 64 9.547 6.095 80 10.999 6.248 96 12.443 7.756 112 14.129 9.62 128 15.295 6.534 144 16.756 8.226 160 18.18 9.656 176 19.836 11.556 192 22.038 8.307 208 24.303 9.692 224 26.656 11.115 240 28.981 13.119 256 31.399 10.035 Raghuveer
On Thu, Mar 13, 2025 at 11:38 PM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > > > I've gone ahead and added the generated AVX-512 algorithm in > > > v14-0005 > > Here is my benchmark numbers (master v/s v14) on TGL (time in ms): > > Buf-size-bytes Master v14 > 64 9.547 6.095 ... > 256 31.399 10.035 Thanks for testing! Looks good. I'll take a look at the configure checks soon, since I had some questions there. -- John Naylor Amazon Web Services
On Mon, Mar 24, 2025 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > I'll take a look at the configure > checks soon, since I had some questions there. I'm leaning towards a length limit for v15-0001 so that inlined instructions are likely to be unrolled. Aside from lack of commit message, I think that one is ready for commit soon-ish. I'm feeling pretty good about 0002, but since there is still room for cosmetic fiddling, I want to let it sit for a bit longer. I felt the previous proposals for configure.ac were unnecessarily invasive, and the message looked out of place, so I made configure.ac more similar to master, using the AVX popcount stuff as a model. I also went the extra step and added a separate AC_MSG_CHECKING for vectorized CRC. I'm not sure we really need that, but this algorithm is trivially adoptable to Arm so it might be welcome for visibility. For Meson, I just made the CRC checking comment a bit more general, since keeping up this level of detail would result a loss in readability. 0003 is just to demonstrate on CI that we are in fact computing the same answer as master. An earlier patch had some additional tests in strings.sql but I have yet to dig those out. -- John Naylor Amazon Web Services
Attachment
On Mon, Mar 24, 2025 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote: > > I'll take a look at the configure > checks soon, since I had some questions there. One other thing I forgot to mention: The previous test function had local constants that the compiler was able to fold, resulting in no actual vector instructions being emitted: movabs rdx, 12884901891 xor eax, eax crc32 rax, rdx crc32 rax, rdx ret That may be okay for practical purposes, but in the spirit of commit fdb5dd6331e30 I changed it in v15 to use global variables and made sure it emits what the function attributes are intended for: vmovdqu64 zmm3, ZMMWORD PTR x[rip] xor eax, eax vpclmulqdq zmm0, zmm3, ZMMWORD PTR y[rip], 0 vextracti32x4 xmm2, zmm0, 1 vmovdqa64 xmm1, xmm0 vmovdqu64 ZMMWORD PTR y[rip], zmm0 vextracti32x4 xmm0, zmm0, 2 vpternlogq xmm1, xmm2, xmm0, 150 vmovq rdx, xmm1 crc32 rax, rdx vzeroupper ret -- John Naylor Amazon Web Services
Hello John, v15 LGTM. Couple of minor comments: > I'm leaning towards a length limit for v15-0001 so that inlined instructions are > likely to be unrolled. Aside from lack of commit message, I think that one is ready > for commit soon-ish. +1 > I'm feeling pretty good about 0002, but since there is still room for cosmetic > fiddling, I want to let it sit for a bit longer. (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of pg_popcount_avx512.c but perhaps that’s fine for now.I will propose a consolidated SIMD runtime check in a follow up patch. (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since they contain both sse42 and avx512 versions. > I felt the previous proposals for configure.ac were unnecessarily invasive, and the > message looked out of place, so I made configure.ac more similar to master, using > the AVX popcount stuff as a model. Looks good to me. Raghuveer
On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > Hello John, > > v15 LGTM. Couple of minor comments: > > > I'm leaning towards a length limit for v15-0001 so that inlined instructions are > > likely to be unrolled. Aside from lack of commit message, I think that one is ready > > for commit soon-ish. > > +1 Thanks for looking! This has been committed. > > I'm feeling pretty good about 0002, but since there is still room for cosmetic > > fiddling, I want to let it sit for a bit longer. > > (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of pg_popcount_avx512.c but perhaps that’s fine fornow. I will propose a consolidated SIMD runtime check in a follow up patch. Yeah, I was thinking a small amount of duplication is tolerable. As far as possible better abstraction for next cycle, I went looking and found this, which has some features we've talked about: https://zolk3ri.name/cgit/cpudetect/tree/cpudetect.h > (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since they contain both sse42 and avx512 versions. The name is now not quite accurate, but it's not exactly misleading either. I'm leaning towards keeping it the same, so for now I've just updated the header comment. For v16, I made another pass through made some more mostly superficial adjustments: - copied rewritten Meson comment to configure.ac - added some more #include guards out of paranoia - added tests with longer lengths that exercise the new paths - adjusted configure / meson messages for consistency - changed not-quite-accurate wording about "AVX-512 CRC instructions" - used "PCLMUL" only when talking about specific intrinsics and prefer "AVX-512" elsewhere, to head off potential future confusion with Arm PMULL. -- John Naylor Amazon Web Services
Attachment
On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote: > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: >> (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of >> pg_popcount_avx512.c but perhaps that’s fine for now. I will propose a >> consolidated SIMD runtime check in a follow up patch. > > Yeah, I was thinking a small amount of duplication is tolerable. +1. This is easy enough to change in the future if/when we add some sort of centralized CPU feature detection. >> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since >> they contain both sse42 and avx512 versions. > > The name is now not quite accurate, but it's not exactly misleading > either. I'm leaning towards keeping it the same, so for now I've just > updated the header comment. I'm not too worried about this one either. FWIW I'm likely going to look into moving all the x86_64 popcount stuff into pg_popcount_avx512.c and renaming it to pg_popcount_x86_64.c for v19. This would parallel pg_popcount_aarch64.c a bit better, and a file per architecture seems like a logical way to neatly organize things. > For v16, I made another pass through made some more mostly superficial > adjustments: > - copied rewritten Meson comment to configure.ac > - added some more #include guards out of paranoia > - added tests with longer lengths that exercise the new paths > - adjusted configure / meson messages for consistency > - changed not-quite-accurate wording about "AVX-512 CRC instructions" > - used "PCLMUL" only when talking about specific intrinsics and prefer > "AVX-512" elsewhere, to head off potential future confusion with Arm > PMULL. I read through the code a couple of times and nothing stood out to me. -- nathan
On Tue, Apr 1, 2025 at 11:25 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote: > > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > >> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since > >> they contain both sse42 and avx512 versions. > > > > The name is now not quite accurate, but it's not exactly misleading > > either. I'm leaning towards keeping it the same, so for now I've just > > updated the header comment. > > I'm not too worried about this one either. FWIW I'm likely going to look > into moving all the x86_64 popcount stuff into pg_popcount_avx512.c and > renaming it to pg_popcount_x86_64.c for v19. This would parallel > pg_popcount_aarch64.c a bit better, and a file per architecture seems like > a logical way to neatly organize things. Seems like a good idea. > I read through the code a couple of times and nothing stood out to me. Thanks for looking, I plan to commit this over the weekend unless there are objections. -- John Naylor Amazon Web Services
> Thanks for looking, I plan to commit this over the weekend unless there are > objections. LGTM. Raghuveer
On Wed, Apr 02, 2025 at 02:10:40PM +0700, John Naylor wrote: > Thanks for looking, I plan to commit this over the weekend unless > there are objections. I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the "32C" included). I tested the v16 patches (with the macro fixed and assertions enabled) on a machine with AVX-512 (verified with some extra debug logging), and everything passed. -- nathan