Thread: RE: Improve CRC32C performance on SSE4.2

RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:

Hi John,

 

> I'm highly suspicious of these numbers because they show this version

> is about 20x faster than "scalar", so relatively speaking 3x faster

> than the AVX-512 proposal?

 

Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote  for testing CRC32C correctness (which computes crc one byte at time). The code for microbenchmarking is all here: https://github.com/r-devulap/crc32c and this commit https://github.com/r-devulap/crc32c/commit/ca4d1b24fd8af87aab544fb1634523b6657325a0 fixes that. Rerunning the benchmarks gives me more sensible numbers:

 

Comparing scalar_crc32c to sse42_crc32c (from ./bench)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[scalar_crc32c vs. sse42_crc32c]/64                  -0.0972         -0.0971             5             4             5             4

[scalar_crc32c vs. sse42_crc32c]/128                 -0.3048         -0.3048             8             6             8             6

[scalar_crc32c vs. sse42_crc32c]/256                 -0.4610         -0.4610            19            10            19            10

[scalar_crc32c vs. sse42_crc32c]/512                 -0.6432         -0.6432            50            18            50            18

[scalar_crc32c vs. sse42_crc32c]/1024                -0.7192         -0.7192           121            34           121            34

[scalar_crc32c vs. sse42_crc32c]/2048                -0.7275         -0.7276           259            70           259            70

 

> Luckily, that's easily fixable: It turns out the implementation

> in the paper (and chromium) has a very inefficient finalization step,

>  using more carryless multiplications and plenty of other operations.

> After the main loop, and at the end, it's much more efficient to

> convert the 128-bit intermediate result directly into a CRC in the

> usual way.

 

Thank you for pointing this out and also fixing it! This improves over the chromium version by 10-25% especially for with smaller byte size 64 – 512 bytes:

 

Comparing sse42_crc32c to corsix_crc32c (from ./bench)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[sse42_crc32c vs. corsix_crc32c]/64                  -0.2696         -0.2696             4             3             4             3

[sse42_crc32c vs. corsix_crc32c]/128                 -0.1551         -0.1552             6             5             6             5

[sse42_crc32c vs. corsix_crc32c]/256                 -0.1787         -0.1787            10             8            10             8

[sse42_crc32c vs. corsix_crc32c]/512                 -0.1351         -0.1351            18            15            18            15

[sse42_crc32c vs. corsix_crc32c]/1024                -0.0972         -0.0972            34            31            34            31

[sse42_crc32c vs. corsix_crc32c]/2048                -0.0763         -0.0763            69            64            69            64

OVERALL_GEOMEAN                                      -0.1544         -0.1544             0             0             0             0

 

> I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above.

 

I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm that behavior on your end too?

 

---

src/port/pg_crc32c_sse42.c | 1 +

1 file changed, 1 insertion(+)

 

diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c

index a8c1e5609b..a350b1b93a 100644

--- a/src/port/pg_crc32c_sse42.c

+++ b/src/port/pg_crc32c_sse42.c

@@ -81,6 +81,7 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)

#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))

#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))

pg_attribute_no_sanitize_alignment()

+__attribute__((optimize("-O3")))

pg_attribute_target("sse4.2,pclmul")

pg_crc32c

pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t len)

--

 

You could also just build with export CFLAGS="-O3" instead of adding the function attribute.

 

> I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.

 

May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.

 

> It's probably okay to fold these together in the same compile-time

> check, since both are fairly old by now, but for those following

> along, pclmul is not in SSE 4.2 and is a bit newer. So this would

> cause machines building on Nehalem (2008) to fail the check and go

> back to slicing-by-8 with it written this way.

 

Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly.  Let me know which one you would prefer.

 

Raghuveer

 

From: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers@lists.postgresql.org
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Subject: Improve CRC32C performance on SSE4.2

 

This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in  this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium’s copy of zlib with modified constants for crc32c computation. See:

 

https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c

 

Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):

 

Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[scalar_crc32c vs. sse42_crc32c]/64                  -0.8147         -0.8148            33             6            33             6

[scalar_crc32c vs. sse42_crc32c]/128                -0.8962         -0.8962            88             9            88             9

[scalar_crc32c vs. sse42_crc32c]/256                -0.9200         -0.9200           211            17           211            17

[scalar_crc32c vs. sse42_crc32c]/512                -0.9389         -0.9389           486            30           486            30

[scalar_crc32c vs. sse42_crc32c]/1024              -0.9452         -0.9452          1037            57          1037            57

[scalar_crc32c vs. sse42_crc32c]/2048             -0.9456         -0.9456          2140           116          2140           116

 

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
> I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from
corsixis slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. 

That matches my findings as well.

> I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates
lessthan optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm
thatbehavior on your end too? 

On my machine that still regresses compared to master in that range
(although by not as much) so I still think 128 bytes is the right
threshold.

The effect of -O3 with gcc14.2 is that the single-block loop (after
the 4-block loop) is unrolled. Unrolling adds branches and binary
space, so it'd be nice to avoid that even for systems that build with
-O3. I tried leaving out the single block loop, so that the tail is
called for the remaining <63 bytes, and it's actually better:

v2:

128
latency average = 10.256 ms
144
latency average = 11.393 ms
160
latency average = 12.977 ms
176
latency average = 14.364 ms
192
latency average = 12.627 ms

remove single-block loop:

128
latency average = 10.211 ms
144
latency average = 10.987 ms
160
latency average = 12.094 ms
176
latency average = 12.927 ms
192
latency average = 12.466 ms

Keeping the extra loop is probably better at this benchmark on newer
machines, but I don't think it's worth it. Microbenchmarks with fixed
sized input don't take branch mispredicts into account.

> > I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
>
> May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.

It's an i7-7500U / Kaby Lake.

> > It's probably okay to fold these together in the same compile-time
> > check, since both are fairly old by now, but for those following
> > along, pclmul is not in SSE 4.2 and is a bit newer. So this would
> > cause machines building on Nehalem (2008) to fail the check and go
> > back to slicing-by-8 with it written this way.
>
> Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to
crashwith segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid
checkfor pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If
youcare about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right
pointeraccordingly.  Let me know which one you would prefer. 

Okay, Nehalem is 17 years old, and the additional cpuid check would
still work on hardware 14-15 years old, so I think it's fine to bump
the requirement for runtime hardware support.

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi, 

> 2. Unfortunately, there is another wrinkle that I failed to consider: If you search
> the web for "VirtualBox pclmulqdq" you can see a few reports from not very long
> ago that some hypervisors don't enable the CPUID for pclmul. I don't know how
> big a problem that is in practice today, but it seems we should actually have
> separate checks, with fallback. Sorry I didn't think of this earlier.

If someone using a VM that doesn't support a 15 yr old feature, then I would argue performance is not 
the top priority for them.  But that’s up to you. I will work on adding it unless you change your mind.  

Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
features support? The former macro is used to enable running the SSE42 version without a runtime check 
when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated
as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it.
The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. 
Let me know what you think.

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Feb 12, 2025 at 09:02:27PM +0000, Devulapalli, Raghuveer wrote:
> Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
> features support? The former macro is used to enable running the SSE42 version without a runtime check 
> when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated
> as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it.
> The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. 
> Let me know what you think.

I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
overhead if possible.  I looked at switching to always using runtime checks
for this stuff, and we concluded that we'd better not [0].

[0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
> overhead if possible.  I looked at switching to always using runtime checks for this
> stuff, and we concluded that we'd better not [0].
>
> [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13

Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that will
looka little ugly, I might need to think about a cleaner way to do this.  

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Feb 12, 2025 at 09:48:57PM +0000, Devulapalli, Raghuveer wrote:
>> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
>> overhead if possible.  I looked at switching to always using runtime checks for this
>> stuff, and we concluded that we'd better not [0].
> 
> Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that
willlook a little ugly, I might need to think about a cleaner way to do this. 
 

Well, I suspect the AVX-512 version will pretty much always need the
runtime check given that its not available on a lot of newer hardware and
requires a bunch of extra runtime checks (see pg_popcount_avx512.c).  But
it might be worth doing for PCLMUL.  Otherwise, I think we'd have to leave
out the PCLMUL optimizations if built with -msse4.2 -mpclmul because we
don't want to regress existing -msse4.2 users with a runtime check.

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> Well, I suspect the AVX-512 version will pretty much always need the runtime
> check given that its not available on a lot of newer hardware and requires a
> bunch of extra runtime checks (see pg_popcount_avx512.c).  But it might be
> worth doing for PCLMUL.  Otherwise, I think we'd have to leave out the PCLMUL
> optimizations if built with -msse4.2 -mpclmul because we don't want to regress
> existing -msse4.2 users with a runtime check.

Sounds good to me. Although, users building with just -msse4.2 will now encounter an
an additional pclmul runtime check. That would be a regression unless they update to
building with both -msse4.2 and -mpclmul.

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Feb 12, 2025 at 10:12:20PM +0000, Devulapalli, Raghuveer wrote:
>> Well, I suspect the AVX-512 version will pretty much always need the runtime
>> check given that its not available on a lot of newer hardware and requires a
>> bunch of extra runtime checks (see pg_popcount_avx512.c).  But it might be
>> worth doing for PCLMUL.  Otherwise, I think we'd have to leave out the PCLMUL
>> optimizations if built with -msse4.2 -mpclmul because we don't want to regress
>> existing -msse4.2 users with a runtime check.
> 
> Sounds good to me. Although, users building with just -msse4.2 will now encounter an 
> an additional pclmul runtime check. That would be a regression unless they update to 
> building with both -msse4.2 and -mpclmul. 

My thinking was that building with just -msse4.2 would cause the existing
SSE 4.2 implementation to be used (without the function pointer).  That's
admittedly a bit goofy because they'd miss out on the PCLMUL optimization,
but things at least don't get any worse for them.

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> > Sounds good to me. Although, users building with just -msse4.2 will
> > now encounter an an additional pclmul runtime check. That would be a
> > regression unless they update to building with both -msse4.2 and -mpclmul.
>
> My thinking was that building with just -msse4.2 would cause the existing SSE 4.2
> implementation to be used (without the function pointer).  That's admittedly a bit
> goofy because they'd miss out on the PCLMUL optimization, but things at least
> don't get any worse for them.

Right. We are only talking about a regression for small potion of people who build
with -msse4.2 and run on Nehalem/VM with pclmul disabled where we will run
the cpuid check for pclmul and still pick the sse4.2 version.

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> I tried using branching for the runtime check, and this looks like the
> way to go:
> - Existing -msse4.2 builders will still call directly, but inside the
> function there is a length check and only for long input will it do a
> runtime check for pclmul.
> - This smooths the way for -msse4.2 (and the equivalent on Arm) to
> inline calls with short constant input (e.g. WAL insert lock),
> although I've not done that here.
> - This can be a simple starting point for consolidating runtime
> checks, as was proposed for popcount in the AVX-512 CRC thread, but
> with branching my model was Andres' sketch here:
> 
> https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76one%40awork3.anarazel.de

Oh, nifty.  IIUC this should help avoid quite a bit of overhead even before
adding the PCLMUL stuff since it removes the function pointers for the
runtime-check builds.

While this needn't block this patch set, I do find the dispatch code to be
pretty complicated.  Maybe we can improve that in the future by using
macros to auto-generate much of it.  My concern here is less about this
particular patch set and more about the long term maintainability as we add
more and more stuff like it, each with its own tangled web of build and
dispatch rules.

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi John, 

I added dispatch logic for the new pclmul version on top of your v5 (which seems outdated now and so I will refrain
fromposting a patch here). Could you take a look at the approach here to see if this makes sense?  The logic in the
pg_comp_crc32c_choosefunction is the main change and seems a lot cleaner to me. 
 

See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/ 

Raghuveer

> -----Original Message-----
> From: John Naylor <johncnaylorls@gmail.com>
> Sent: Monday, February 17, 2025 10:40 PM
> To: Nathan Bossart <nathandbossart@gmail.com>
> Cc: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>; pgsql-
> hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com>
> Subject: Re: Improve CRC32C performance on SSE4.2
> 
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandbossart@gmail.com>
> wrote:
> >
> > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > > I tried using branching for the runtime check, and this looks like
> > > the way to go:
> > > - Existing -msse4.2 builders will still call directly, but inside
> > > the function there is a length check and only for long input will it
> > > do a runtime check for pclmul.
> > > - This smooths the way for -msse4.2 (and the equivalent on Arm) to
> > > inline calls with short constant input (e.g. WAL insert lock),
> > > although I've not done that here.
> > > - This can be a simple starting point for consolidating runtime
> > > checks, as was proposed for popcount in the AVX-512 CRC thread, but
> > > with branching my model was Andres' sketch here:
> > >
> > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76on
> > > e%40awork3.anarazel.de
> >
> > Oh, nifty.  IIUC this should help avoid quite a bit of overhead even
> > before adding the PCLMUL stuff since it removes the function pointers
> > for the runtime-check builds.
> 
> I figured the same, but it doesn't seem to help on the microbenchmark.
> (I also tried the pg_waldump benchmark you used, but I couldn't get it working so
> I'm probably missing a step.)
> 
> master:
> 
> 20
> latency average = 3.958 ms
> latency average = 3.860 ms
> latency average = 3.857 ms
> 32
> latency average = 5.921 ms
> latency average = 5.166 ms
> latency average = 5.128 ms
> 48
> latency average = 6.384 ms
> latency average = 6.022 ms
> latency average = 5.991 ms
> 
> v6:
> 
> 20
> latency average = 4.084 ms
> latency average = 3.879 ms
> latency average = 3.896 ms
> 32
> latency average = 5.533 ms
> latency average = 5.536 ms
> latency average = 5.573 ms
> 48
> latency average = 6.201 ms
> latency average = 6.205 ms
> latency average = 6.111 ms
> 
> > While this needn't block this patch set, I do find the dispatch code
> > to be pretty complicated.  Maybe we can improve that in the future by
> > using macros to auto-generate much of it.  My concern here is less
> > about this particular patch set and more about the long term
> > maintainability as we add more and more stuff like it, each with its
> > own tangled web of build and dispatch rules.
> 
> I think the runtime dispatch is fairly simple for this case. To my taste, the worse
> maintainability hazard is the building. To make that better, I'd do this:
> - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> unconditionally for each platform
> - Initialize the runtime info by CPU platform and not other symbols where possible
> (I guess anything needing AVX-512 will still be a mess)
> - Put all hardware CRC .c files into a single file pg_crc32c_hw.c.
> Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use
> by platform and have a separate pg_crc32c_hw_impl.h header that has the
> simple loop with these macros. That header would for now only be included into
> pg_crc32c_hw.c.
> 
> The first two could be done as part of this patch or soon after. The third is a bit
> more invasive, but seems like a necessary prerequisite for inlining on small
> constant input, to keep that from being a mess.
> 
> There's another potential irritation for maintenance too: I spent too much time
> only adding pg_cpucap_initialize() to frontend main() functions that need it. I
> should have just added them to every binary.
> We don't add new programs often, but it's still less automatic than the function
> pointer way, so I'm open to other ideas.
> 
> Attached v7 to fix CI failures.
> --
> John Naylor
> Amazon Web Services

RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi John, 

Just a few comments on v7: 

> pg_cpucap_crc32c

I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy
toextend in the future and keeps the functionalities separate. 
 

> - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> unconditionally for each platform

+1. Sounds perfect. We should also move the avx512 runtime detection of popcount here. The runtime detection code could
alsobe appended with function __attribute__((constructor)) so that it gets executed before main. 
 

> I think the runtime dispatch is fairly simple for this case.
+ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
+ {
+ ....
+ #ifdef HAVE_PCLMUL_RUNTIME
+   if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL))

IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of
thepg_crc32c.h to it own pg_crc32c_dispatch.c file. 
 

Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happy
toprovide feedback and review your work, or can continue development work based on any feedback. Please let me know
whichyou'd prefer.
 

Raghuveer


Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
> The runtime detection code could also be appended with function __attribute__((constructor)) so that it gets executed
beforemain. 

Hmm! I didn't know about that, thanks! It works on old gcc/clang, so
that's good. I can't verify online if Arm etc. executes properly since
the execute button is greyed out, but I don't see why it wouldn't:

https://godbolt.org/z/3as9MGr3G

Then we could have:

#ifdef FRONTEND
pg_attribute_constructor()
#endif
void
pg_cpucap_initialize(void)
{
  ...
}

Now, there is no equivalent on MSVC and workarounds are fragile [1].
Maybe we could only assert initialization happened for the backend and
for frontend either
- add a couple manual initializations to to the frontend programs
where we don't want to lose performance for non-gcc/clang.
- require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by
Thomas M.'s earlier findings on popcount (also SSE4.2) [2]

The first is less risky but less tidy.

> > I think the runtime dispatch is fairly simple for this case.
> + pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
> + {
> + ....
> + #ifdef HAVE_PCLMUL_RUNTIME
> +   if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL))
>
> IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of
thepg_crc32c.h to it own pg_crc32c_dispatch.c file. 

That makes sense, but it does close the door on future inlining.

> Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm
happyto provide feedback and review your work, or can continue development work based on any feedback. Please let me
knowwhich you'd prefer. 

Sure. Depending on any feedback on the above constructor technique,
I'll work on the following, since I've prototyped most of the first
and the second is mostly copy-and-paste from your earlier work:

> > pg_cpucap_crc32c
>
> I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy
toextend in the future and keeps the functionalities separate. 
>
> > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> > unconditionally for each platform
>
> +1. Sounds perfect. We should also move the avx512 runtime detection of popcount here.

[1] https://stackoverflow.com/questions/1113409/attribute-constructor-equivalent-in-vc
[2] https://www.postgresql.org/message-id/CA+hUKGKS64zJezV9y9mPcB-J0i+fLGiv3FAdwSH_3SCaVdrjyQ@mail.gmail.com

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> Now, there is no equivalent on MSVC and workarounds are fragile [1].
> Maybe we could only assert initialization happened for the backend and for
> frontend either
> - add a couple manual initializations to to the frontend programs where we don't
> want to lose performance for non-gcc/clang.
> - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas
> M.'s earlier findings on popcount (also SSE4.2) [2]
> 
> The first is less risky but less tidy.

Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my strong
suit:/ 
 

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
>
> > Now, there is no equivalent on MSVC and workarounds are fragile [1].
> > Maybe we could only assert initialization happened for the backend and for
> > frontend either
> > - add a couple manual initializations to to the frontend programs where we don't
> > want to lose performance for non-gcc/clang.
> > - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas
> > M.'s earlier findings on popcount (also SSE4.2) [2]
> >
> > The first is less risky but less tidy.
>
> Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my
strongsuit :/ 

Here's another idea to make it more automatic: Give up on initializing
every capability at once. The first time we call CRC, it will be
uninitialized, so this part:

if (pg_cpucap & PGCPUCAP_CRC32C)
  return COMP_CRC32C_HW(crc, data, len);
else
  return pg_comp_crc32c_sb8(crc, data, len);

...will call the SB8 path. Inside there, do the check:

#if defined(HAVE_CRC_RUNTIME)
// separate init bit for each capability
if (unlikely(pg_cpucap & PGCPUCAP_CRC32C_INIT == 0))
{
  pg_cpucap_crc32c(); // also sets PGCPUCAP_CRC32C_INIT
  if (pg_cpucap & PGCPUCAP_CRC32C)
    return COMP_CRC32C_HW(crc, data, len);
}
#endif
// ...fallthrough to SB8

--
John Naylor
Amazon Web Services