Thread: RE: Improve CRC32C performance on SSE4.2
Hi John,
> I'm highly suspicious of these numbers because they show this version
> is about 20x faster than "scalar", so relatively speaking 3x faster
> than the AVX-512 proposal?
Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote for testing CRC32C correctness (which computes crc one byte at time). The code for microbenchmarking is all here: https://github.com/r-devulap/crc32c and this commit https://github.com/r-devulap/crc32c/commit/ca4d1b24fd8af87aab544fb1634523b6657325a0 fixes that. Rerunning the benchmarks gives me more sensible numbers:
Comparing scalar_crc32c to sse42_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.0972 -0.0971 5 4 5 4
[scalar_crc32c vs. sse42_crc32c]/128 -0.3048 -0.3048 8 6 8 6
[scalar_crc32c vs. sse42_crc32c]/256 -0.4610 -0.4610 19 10 19 10
[scalar_crc32c vs. sse42_crc32c]/512 -0.6432 -0.6432 50 18 50 18
[scalar_crc32c vs. sse42_crc32c]/1024 -0.7192 -0.7192 121 34 121 34
[scalar_crc32c vs. sse42_crc32c]/2048 -0.7275 -0.7276 259 70 259 70
> Luckily, that's easily fixable: It turns out the implementation
> in the paper (and chromium) has a very inefficient finalization step,
> using more carryless multiplications and plenty of other operations.
> After the main loop, and at the end, it's much more efficient to
> convert the 128-bit intermediate result directly into a CRC in the
> usual way.
Thank you for pointing this out and also fixing it! This improves over the chromium version by 10-25% especially for with smaller byte size 64 – 512 bytes:
Comparing sse42_crc32c to corsix_crc32c (from ./bench)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[sse42_crc32c vs. corsix_crc32c]/64 -0.2696 -0.2696 4 3 4 3
[sse42_crc32c vs. corsix_crc32c]/128 -0.1551 -0.1552 6 5 6 5
[sse42_crc32c vs. corsix_crc32c]/256 -0.1787 -0.1787 10 8 10 8
[sse42_crc32c vs. corsix_crc32c]/512 -0.1351 -0.1351 18 15 18 15
[sse42_crc32c vs. corsix_crc32c]/1024 -0.0972 -0.0972 34 31 34 31
[sse42_crc32c vs. corsix_crc32c]/2048 -0.0763 -0.0763 69 64 69 64
OVERALL_GEOMEAN -0.1544 -0.1544 0 0 0 0
> I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above.
I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm that behavior on your end too?
---
src/port/pg_crc32c_sse42.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c
index a8c1e5609b..a350b1b93a 100644
--- a/src/port/pg_crc32c_sse42.c
+++ b/src/port/pg_crc32c_sse42.c
@@ -81,6 +81,7 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)
#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))
#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))
pg_attribute_no_sanitize_alignment()
+__attribute__((optimize("-O3")))
pg_attribute_target("sse4.2,pclmul")
pg_crc32c
pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t len)
--
You could also just build with export CFLAGS="-O3" instead of adding the function attribute.
> I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.
> It's probably okay to fold these together in the same compile-time
> check, since both are fairly old by now, but for those following
> along, pclmul is not in SSE 4.2 and is a bit newer. So this would
> cause machines building on Nehalem (2008) to fail the check and go
> back to slicing-by-8 with it written this way.
Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly. Let me know which one you would prefer.
Raghuveer
From: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers@lists.postgresql.org
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Subject: Improve CRC32C performance on SSE4.2
This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium’s copy of zlib with modified constants for crc32c computation. See:
https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c
Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):
Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
[scalar_crc32c vs. sse42_crc32c]/64 -0.8147 -0.8148 33 6 33 6
[scalar_crc32c vs. sse42_crc32c]/128 -0.8962 -0.8962 88 9 88 9
[scalar_crc32c vs. sse42_crc32c]/256 -0.9200 -0.9200 211 17 211 17
[scalar_crc32c vs. sse42_crc32c]/512 -0.9389 -0.9389 486 30 486 30
[scalar_crc32c vs. sse42_crc32c]/1024 -0.9452 -0.9452 1037 57 1037 57
[scalar_crc32c vs. sse42_crc32c]/2048 -0.9456 -0.9456 2140 116 2140 116
Raghuveer
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsixis slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. That matches my findings as well. > I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates lessthan optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm thatbehavior on your end too? On my machine that still regresses compared to master in that range (although by not as much) so I still think 128 bytes is the right threshold. The effect of -O3 with gcc14.2 is that the single-block loop (after the 4-block loop) is unrolled. Unrolling adds branches and binary space, so it'd be nice to avoid that even for systems that build with -O3. I tried leaving out the single block loop, so that the tail is called for the remaining <63 bytes, and it's actually better: v2: 128 latency average = 10.256 ms 144 latency average = 11.393 ms 160 latency average = 12.977 ms 176 latency average = 14.364 ms 192 latency average = 12.627 ms remove single-block loop: 128 latency average = 10.211 ms 144 latency average = 10.987 ms 160 latency average = 12.094 ms 176 latency average = 12.927 ms 192 latency average = 12.466 ms Keeping the extra loop is probably better at this benchmark on newer machines, but I don't think it's worth it. Microbenchmarks with fixed sized input don't take branch mispredicts into account. > > I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction. > > May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor. It's an i7-7500U / Kaby Lake. > > It's probably okay to fold these together in the same compile-time > > check, since both are fairly old by now, but for those following > > along, pclmul is not in SSE 4.2 and is a bit newer. So this would > > cause machines building on Nehalem (2008) to fail the check and go > > back to slicing-by-8 with it written this way. > > Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crashwith segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid checkfor pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If youcare about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointeraccordingly. Let me know which one you would prefer. Okay, Nehalem is 17 years old, and the additional cpuid check would still work on hardware 14-15 years old, so I think it's fine to bump the requirement for runtime hardware support. -- John Naylor Amazon Web Services
Hi, > 2. Unfortunately, there is another wrinkle that I failed to consider: If you search > the web for "VirtualBox pclmulqdq" you can see a few reports from not very long > ago that some hypervisors don't enable the CPUID for pclmul. I don't know how > big a problem that is in practice today, but it seems we should actually have > separate checks, with fallback. Sorry I didn't think of this earlier. If someone using a VM that doesn't support a 15 yr old feature, then I would argue performance is not the top priority for them. But that’s up to you. I will work on adding it unless you change your mind. Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK features support? The former macro is used to enable running the SSE42 version without a runtime check when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it. The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. Let me know what you think. Raghuveer
On Wed, Feb 12, 2025 at 09:02:27PM +0000, Devulapalli, Raghuveer wrote: > Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK > features support? The former macro is used to enable running the SSE42 version without a runtime check > when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated > as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it. > The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. > Let me know what you think. I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer overhead if possible. I looked at switching to always using runtime checks for this stuff, and we concluded that we'd better not [0]. [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 -- nathan
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks for this > stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that will looka little ugly, I might need to think about a cleaner way to do this. Raghuveer
On Wed, Feb 12, 2025 at 09:48:57PM +0000, Devulapalli, Raghuveer wrote: >> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer >> overhead if possible. I looked at switching to always using runtime checks for this >> stuff, and we concluded that we'd better not [0]. > > Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that willlook a little ugly, I might need to think about a cleaner way to do this. Well, I suspect the AVX-512 version will pretty much always need the runtime check given that its not available on a lot of newer hardware and requires a bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL optimizations if built with -msse4.2 -mpclmul because we don't want to regress existing -msse4.2 users with a runtime check. -- nathan
> Well, I suspect the AVX-512 version will pretty much always need the runtime > check given that its not available on a lot of newer hardware and requires a > bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be > worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL > optimizations if built with -msse4.2 -mpclmul because we don't want to regress > existing -msse4.2 users with a runtime check. Sounds good to me. Although, users building with just -msse4.2 will now encounter an an additional pclmul runtime check. That would be a regression unless they update to building with both -msse4.2 and -mpclmul. Raghuveer
On Wed, Feb 12, 2025 at 10:12:20PM +0000, Devulapalli, Raghuveer wrote: >> Well, I suspect the AVX-512 version will pretty much always need the runtime >> check given that its not available on a lot of newer hardware and requires a >> bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be >> worth doing for PCLMUL. Otherwise, I think we'd have to leave out the PCLMUL >> optimizations if built with -msse4.2 -mpclmul because we don't want to regress >> existing -msse4.2 users with a runtime check. > > Sounds good to me. Although, users building with just -msse4.2 will now encounter an > an additional pclmul runtime check. That would be a regression unless they update to > building with both -msse4.2 and -mpclmul. My thinking was that building with just -msse4.2 would cause the existing SSE 4.2 implementation to be used (without the function pointer). That's admittedly a bit goofy because they'd miss out on the PCLMUL optimization, but things at least don't get any worse for them. -- nathan
> > Sounds good to me. Although, users building with just -msse4.2 will > > now encounter an an additional pclmul runtime check. That would be a > > regression unless they update to building with both -msse4.2 and -mpclmul. > > My thinking was that building with just -msse4.2 would cause the existing SSE 4.2 > implementation to be used (without the function pointer). That's admittedly a bit > goofy because they'd miss out on the PCLMUL optimization, but things at least > don't get any worse for them. Right. We are only talking about a regression for small potion of people who build with -msse4.2 and run on Nehalem/VM with pclmul disabled where we will run the cpuid check for pclmul and still pick the sse4.2 version. Raghuveer
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > I tried using branching for the runtime check, and this looks like the > way to go: > - Existing -msse4.2 builders will still call directly, but inside the > function there is a length check and only for long input will it do a > runtime check for pclmul. > - This smooths the way for -msse4.2 (and the equivalent on Arm) to > inline calls with short constant input (e.g. WAL insert lock), > although I've not done that here. > - This can be a simple starting point for consolidating runtime > checks, as was proposed for popcount in the AVX-512 CRC thread, but > with branching my model was Andres' sketch here: > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76one%40awork3.anarazel.de Oh, nifty. IIUC this should help avoid quite a bit of overhead even before adding the PCLMUL stuff since it removes the function pointers for the runtime-check builds. While this needn't block this patch set, I do find the dispatch code to be pretty complicated. Maybe we can improve that in the future by using macros to auto-generate much of it. My concern here is less about this particular patch set and more about the long term maintainability as we add more and more stuff like it, each with its own tangled web of build and dispatch rules. -- nathan
Hi John, I added dispatch logic for the new pclmul version on top of your v5 (which seems outdated now and so I will refrain fromposting a patch here). Could you take a look at the approach here to see if this makes sense? The logic in the pg_comp_crc32c_choosefunction is the main change and seems a lot cleaner to me. See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/ Raghuveer > -----Original Message----- > From: John Naylor <johncnaylorls@gmail.com> > Sent: Monday, February 17, 2025 10:40 PM > To: Nathan Bossart <nathandbossart@gmail.com> > Cc: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>; pgsql- > hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com> > Subject: Re: Improve CRC32C performance on SSE4.2 > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandbossart@gmail.com> > wrote: > > > > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > > > I tried using branching for the runtime check, and this looks like > > > the way to go: > > > - Existing -msse4.2 builders will still call directly, but inside > > > the function there is a length check and only for long input will it > > > do a runtime check for pclmul. > > > - This smooths the way for -msse4.2 (and the equivalent on Arm) to > > > inline calls with short constant input (e.g. WAL insert lock), > > > although I've not done that here. > > > - This can be a simple starting point for consolidating runtime > > > checks, as was proposed for popcount in the AVX-512 CRC thread, but > > > with branching my model was Andres' sketch here: > > > > > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76on > > > e%40awork3.anarazel.de > > > > Oh, nifty. IIUC this should help avoid quite a bit of overhead even > > before adding the PCLMUL stuff since it removes the function pointers > > for the runtime-check builds. > > I figured the same, but it doesn't seem to help on the microbenchmark. > (I also tried the pg_waldump benchmark you used, but I couldn't get it working so > I'm probably missing a step.) > > master: > > 20 > latency average = 3.958 ms > latency average = 3.860 ms > latency average = 3.857 ms > 32 > latency average = 5.921 ms > latency average = 5.166 ms > latency average = 5.128 ms > 48 > latency average = 6.384 ms > latency average = 6.022 ms > latency average = 5.991 ms > > v6: > > 20 > latency average = 4.084 ms > latency average = 3.879 ms > latency average = 3.896 ms > 32 > latency average = 5.533 ms > latency average = 5.536 ms > latency average = 5.573 ms > 48 > latency average = 6.201 ms > latency average = 6.205 ms > latency average = 6.111 ms > > > While this needn't block this patch set, I do find the dispatch code > > to be pretty complicated. Maybe we can improve that in the future by > > using macros to auto-generate much of it. My concern here is less > > about this particular patch set and more about the long term > > maintainability as we add more and more stuff like it, each with its > > own tangled web of build and dispatch rules. > > I think the runtime dispatch is fairly simple for this case. To my taste, the worse > maintainability hazard is the building. To make that better, I'd do this: > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them > unconditionally for each platform > - Initialize the runtime info by CPU platform and not other symbols where possible > (I guess anything needing AVX-512 will still be a mess) > - Put all hardware CRC .c files into a single file pg_crc32c_hw.c. > Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use > by platform and have a separate pg_crc32c_hw_impl.h header that has the > simple loop with these macros. That header would for now only be included into > pg_crc32c_hw.c. > > The first two could be done as part of this patch or soon after. The third is a bit > more invasive, but seems like a necessary prerequisite for inlining on small > constant input, to keep that from being a mess. > > There's another potential irritation for maintenance too: I spent too much time > only adding pg_cpucap_initialize() to frontend main() functions that need it. I > should have just added them to every binary. > We don't add new programs often, but it's still less automatic than the function > pointer way, so I'm open to other ideas. > > Attached v7 to fix CI failures. > -- > John Naylor > Amazon Web Services
Hi John, Just a few comments on v7: > pg_cpucap_crc32c I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy toextend in the future and keeps the functionalities separate. > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them > unconditionally for each platform +1. Sounds perfect. We should also move the avx512 runtime detection of popcount here. The runtime detection code could alsobe appended with function __attribute__((constructor)) so that it gets executed before main. > I think the runtime dispatch is fairly simple for this case. + pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) + { + .... + #ifdef HAVE_PCLMUL_RUNTIME + if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL)) IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of thepg_crc32c.h to it own pg_crc32c_dispatch.c file. Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happy toprovide feedback and review your work, or can continue development work based on any feedback. Please let me know whichyou'd prefer. Raghuveer
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > The runtime detection code could also be appended with function __attribute__((constructor)) so that it gets executed beforemain. Hmm! I didn't know about that, thanks! It works on old gcc/clang, so that's good. I can't verify online if Arm etc. executes properly since the execute button is greyed out, but I don't see why it wouldn't: https://godbolt.org/z/3as9MGr3G Then we could have: #ifdef FRONTEND pg_attribute_constructor() #endif void pg_cpucap_initialize(void) { ... } Now, there is no equivalent on MSVC and workarounds are fragile [1]. Maybe we could only assert initialization happened for the backend and for frontend either - add a couple manual initializations to to the frontend programs where we don't want to lose performance for non-gcc/clang. - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas M.'s earlier findings on popcount (also SSE4.2) [2] The first is less risky but less tidy. > > I think the runtime dispatch is fairly simple for this case. > + pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len) > + { > + .... > + #ifdef HAVE_PCLMUL_RUNTIME > + if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL)) > > IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of thepg_crc32c.h to it own pg_crc32c_dispatch.c file. That makes sense, but it does close the door on future inlining. > Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happyto provide feedback and review your work, or can continue development work based on any feedback. Please let me knowwhich you'd prefer. Sure. Depending on any feedback on the above constructor technique, I'll work on the following, since I've prototyped most of the first and the second is mostly copy-and-paste from your earlier work: > > pg_cpucap_crc32c > > I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy toextend in the future and keeps the functionalities separate. > > > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them > > unconditionally for each platform > > +1. Sounds perfect. We should also move the avx512 runtime detection of popcount here. [1] https://stackoverflow.com/questions/1113409/attribute-constructor-equivalent-in-vc [2] https://www.postgresql.org/message-id/CA+hUKGKS64zJezV9y9mPcB-J0i+fLGiv3FAdwSH_3SCaVdrjyQ@mail.gmail.com -- John Naylor Amazon Web Services
> Now, there is no equivalent on MSVC and workarounds are fragile [1]. > Maybe we could only assert initialization happened for the backend and for > frontend either > - add a couple manual initializations to to the frontend programs where we don't > want to lose performance for non-gcc/clang. > - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas > M.'s earlier findings on popcount (also SSE4.2) [2] > > The first is less risky but less tidy. Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my strong suit:/ Raghuveer
On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > > > Now, there is no equivalent on MSVC and workarounds are fragile [1]. > > Maybe we could only assert initialization happened for the backend and for > > frontend either > > - add a couple manual initializations to to the frontend programs where we don't > > want to lose performance for non-gcc/clang. > > - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas > > M.'s earlier findings on popcount (also SSE4.2) [2] > > > > The first is less risky but less tidy. > > Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my strongsuit :/ Here's another idea to make it more automatic: Give up on initializing every capability at once. The first time we call CRC, it will be uninitialized, so this part: if (pg_cpucap & PGCPUCAP_CRC32C) return COMP_CRC32C_HW(crc, data, len); else return pg_comp_crc32c_sb8(crc, data, len); ...will call the SB8 path. Inside there, do the check: #if defined(HAVE_CRC_RUNTIME) // separate init bit for each capability if (unlikely(pg_cpucap & PGCPUCAP_CRC32C_INIT == 0)) { pg_cpucap_crc32c(); // also sets PGCPUCAP_CRC32C_INIT if (pg_cpucap & PGCPUCAP_CRC32C) return COMP_CRC32C_HW(crc, data, len); } #endif // ...fallthrough to SB8 -- John Naylor Amazon Web Services