Thread: RE: Improve CRC32C performance on SSE4.2

RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:

Hi John,

 

> I'm highly suspicious of these numbers because they show this version

> is about 20x faster than "scalar", so relatively speaking 3x faster

> than the AVX-512 proposal?

 

Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote  for testing CRC32C correctness (which computes crc one byte at time). The code for microbenchmarking is all here: https://github.com/r-devulap/crc32c and this commit https://github.com/r-devulap/crc32c/commit/ca4d1b24fd8af87aab544fb1634523b6657325a0 fixes that. Rerunning the benchmarks gives me more sensible numbers:

 

Comparing scalar_crc32c to sse42_crc32c (from ./bench)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[scalar_crc32c vs. sse42_crc32c]/64                  -0.0972         -0.0971             5             4             5             4

[scalar_crc32c vs. sse42_crc32c]/128                 -0.3048         -0.3048             8             6             8             6

[scalar_crc32c vs. sse42_crc32c]/256                 -0.4610         -0.4610            19            10            19            10

[scalar_crc32c vs. sse42_crc32c]/512                 -0.6432         -0.6432            50            18            50            18

[scalar_crc32c vs. sse42_crc32c]/1024                -0.7192         -0.7192           121            34           121            34

[scalar_crc32c vs. sse42_crc32c]/2048                -0.7275         -0.7276           259            70           259            70

 

> Luckily, that's easily fixable: It turns out the implementation

> in the paper (and chromium) has a very inefficient finalization step,

>  using more carryless multiplications and plenty of other operations.

> After the main loop, and at the end, it's much more efficient to

> convert the 128-bit intermediate result directly into a CRC in the

> usual way.

 

Thank you for pointing this out and also fixing it! This improves over the chromium version by 10-25% especially for with smaller byte size 64 – 512 bytes:

 

Comparing sse42_crc32c to corsix_crc32c (from ./bench)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[sse42_crc32c vs. corsix_crc32c]/64                  -0.2696         -0.2696             4             3             4             3

[sse42_crc32c vs. corsix_crc32c]/128                 -0.1551         -0.1552             6             5             6             5

[sse42_crc32c vs. corsix_crc32c]/256                 -0.1787         -0.1787            10             8            10             8

[sse42_crc32c vs. corsix_crc32c]/512                 -0.1351         -0.1351            18            15            18            15

[sse42_crc32c vs. corsix_crc32c]/1024                -0.0972         -0.0972            34            31            34            31

[sse42_crc32c vs. corsix_crc32c]/2048                -0.0763         -0.0763            69            64            69            64

OVERALL_GEOMEAN                                      -0.1544         -0.1544             0             0             0             0

 

> I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above.

 

I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm that behavior on your end too?

 

---

src/port/pg_crc32c_sse42.c | 1 +

1 file changed, 1 insertion(+)

 

diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c

index a8c1e5609b..a350b1b93a 100644

--- a/src/port/pg_crc32c_sse42.c

+++ b/src/port/pg_crc32c_sse42.c

@@ -81,6 +81,7 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)

#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))

#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))

pg_attribute_no_sanitize_alignment()

+__attribute__((optimize("-O3")))

pg_attribute_target("sse4.2,pclmul")

pg_crc32c

pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t len)

--

 

You could also just build with export CFLAGS="-O3" instead of adding the function attribute.

 

> I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.

 

May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.

 

> It's probably okay to fold these together in the same compile-time

> check, since both are fairly old by now, but for those following

> along, pclmul is not in SSE 4.2 and is a bit newer. So this would

> cause machines building on Nehalem (2008) to fail the check and go

> back to slicing-by-8 with it written this way.

 

Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly.  Let me know which one you would prefer.

 

Raghuveer

 

From: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers@lists.postgresql.org
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Subject: Improve CRC32C performance on SSE4.2

 

This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in  this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium’s copy of zlib with modified constants for crc32c computation. See:

 

https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c

 

Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):

 

Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[scalar_crc32c vs. sse42_crc32c]/64                  -0.8147         -0.8148            33             6            33             6

[scalar_crc32c vs. sse42_crc32c]/128                -0.8962         -0.8962            88             9            88             9

[scalar_crc32c vs. sse42_crc32c]/256                -0.9200         -0.9200           211            17           211            17

[scalar_crc32c vs. sse42_crc32c]/512                -0.9389         -0.9389           486            30           486            30

[scalar_crc32c vs. sse42_crc32c]/1024              -0.9452         -0.9452          1037            57          1037            57

[scalar_crc32c vs. sse42_crc32c]/2048             -0.9456         -0.9456          2140           116          2140           116

 

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
> I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from
corsixis slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. 

That matches my findings as well.

> I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates
lessthan optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm
thatbehavior on your end too? 

On my machine that still regresses compared to master in that range
(although by not as much) so I still think 128 bytes is the right
threshold.

The effect of -O3 with gcc14.2 is that the single-block loop (after
the 4-block loop) is unrolled. Unrolling adds branches and binary
space, so it'd be nice to avoid that even for systems that build with
-O3. I tried leaving out the single block loop, so that the tail is
called for the remaining <63 bytes, and it's actually better:

v2:

128
latency average = 10.256 ms
144
latency average = 11.393 ms
160
latency average = 12.977 ms
176
latency average = 14.364 ms
192
latency average = 12.627 ms

remove single-block loop:

128
latency average = 10.211 ms
144
latency average = 10.987 ms
160
latency average = 12.094 ms
176
latency average = 12.927 ms
192
latency average = 12.466 ms

Keeping the extra loop is probably better at this benchmark on newer
machines, but I don't think it's worth it. Microbenchmarks with fixed
sized input don't take branch mispredicts into account.

> > I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
>
> May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.

It's an i7-7500U / Kaby Lake.

> > It's probably okay to fold these together in the same compile-time
> > check, since both are fairly old by now, but for those following
> > along, pclmul is not in SSE 4.2 and is a bit newer. So this would
> > cause machines building on Nehalem (2008) to fail the check and go
> > back to slicing-by-8 with it written this way.
>
> Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to
crashwith segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid
checkfor pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If
youcare about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right
pointeraccordingly.  Let me know which one you would prefer. 

Okay, Nehalem is 17 years old, and the additional cpuid check would
still work on hardware 14-15 years old, so I think it's fine to bump
the requirement for runtime hardware support.

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi, 

> 2. Unfortunately, there is another wrinkle that I failed to consider: If you search
> the web for "VirtualBox pclmulqdq" you can see a few reports from not very long
> ago that some hypervisors don't enable the CPUID for pclmul. I don't know how
> big a problem that is in practice today, but it seems we should actually have
> separate checks, with fallback. Sorry I didn't think of this earlier.

If someone using a VM that doesn't support a 15 yr old feature, then I would argue performance is not 
the top priority for them.  But that’s up to you. I will work on adding it unless you change your mind.  

Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
features support? The former macro is used to enable running the SSE42 version without a runtime check 
when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated
as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it.
The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. 
Let me know what you think.

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Feb 12, 2025 at 09:02:27PM +0000, Devulapalli, Raghuveer wrote:
> Also, do we really need to have both USE_SSE42_CRC32C and USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
> features support? The former macro is used to enable running the SSE42 version without a runtime check 
> when someone builds with -msse4.2. The code looks fine now, but the runtime dispatch rules get complicated
> as we add the PCLMUL and AVX512 dispatch in the future. IMO, this additional complexity is not worth it.
> The cpuid runtime dispatch runs just once when postgres server is first setup and would hardly affect performance. 
> Let me know what you think.

I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
overhead if possible.  I looked at switching to always using runtime checks
for this stuff, and we concluded that we'd better not [0].

[0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
> overhead if possible.  I looked at switching to always using runtime checks for this
> stuff, and we concluded that we'd better not [0].
>
> [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13

Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that will
looka little ugly, I might need to think about a cleaner way to do this.  

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Feb 12, 2025 at 09:48:57PM +0000, Devulapalli, Raghuveer wrote:
>> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
>> overhead if possible.  I looked at switching to always using runtime checks for this
>> stuff, and we concluded that we'd better not [0].
> 
> Does that mean we want this feature for the new PCLMUL (and AVX-512) crc32c implementations too? The code for that
willlook a little ugly, I might need to think about a cleaner way to do this. 
 

Well, I suspect the AVX-512 version will pretty much always need the
runtime check given that its not available on a lot of newer hardware and
requires a bunch of extra runtime checks (see pg_popcount_avx512.c).  But
it might be worth doing for PCLMUL.  Otherwise, I think we'd have to leave
out the PCLMUL optimizations if built with -msse4.2 -mpclmul because we
don't want to regress existing -msse4.2 users with a runtime check.

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> Well, I suspect the AVX-512 version will pretty much always need the runtime
> check given that its not available on a lot of newer hardware and requires a
> bunch of extra runtime checks (see pg_popcount_avx512.c).  But it might be
> worth doing for PCLMUL.  Otherwise, I think we'd have to leave out the PCLMUL
> optimizations if built with -msse4.2 -mpclmul because we don't want to regress
> existing -msse4.2 users with a runtime check.

Sounds good to me. Although, users building with just -msse4.2 will now encounter an
an additional pclmul runtime check. That would be a regression unless they update to
building with both -msse4.2 and -mpclmul.

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Feb 12, 2025 at 10:12:20PM +0000, Devulapalli, Raghuveer wrote:
>> Well, I suspect the AVX-512 version will pretty much always need the runtime
>> check given that its not available on a lot of newer hardware and requires a
>> bunch of extra runtime checks (see pg_popcount_avx512.c).  But it might be
>> worth doing for PCLMUL.  Otherwise, I think we'd have to leave out the PCLMUL
>> optimizations if built with -msse4.2 -mpclmul because we don't want to regress
>> existing -msse4.2 users with a runtime check.
> 
> Sounds good to me. Although, users building with just -msse4.2 will now encounter an 
> an additional pclmul runtime check. That would be a regression unless they update to 
> building with both -msse4.2 and -mpclmul. 

My thinking was that building with just -msse4.2 would cause the existing
SSE 4.2 implementation to be used (without the function pointer).  That's
admittedly a bit goofy because they'd miss out on the PCLMUL optimization,
but things at least don't get any worse for them.

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> > Sounds good to me. Although, users building with just -msse4.2 will
> > now encounter an an additional pclmul runtime check. That would be a
> > regression unless they update to building with both -msse4.2 and -mpclmul.
>
> My thinking was that building with just -msse4.2 would cause the existing SSE 4.2
> implementation to be used (without the function pointer).  That's admittedly a bit
> goofy because they'd miss out on the PCLMUL optimization, but things at least
> don't get any worse for them.

Right. We are only talking about a regression for small potion of people who build
with -msse4.2 and run on Nehalem/VM with pclmul disabled where we will run
the cpuid check for pclmul and still pick the sse4.2 version.

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> I tried using branching for the runtime check, and this looks like the
> way to go:
> - Existing -msse4.2 builders will still call directly, but inside the
> function there is a length check and only for long input will it do a
> runtime check for pclmul.
> - This smooths the way for -msse4.2 (and the equivalent on Arm) to
> inline calls with short constant input (e.g. WAL insert lock),
> although I've not done that here.
> - This can be a simple starting point for consolidating runtime
> checks, as was proposed for popcount in the AVX-512 CRC thread, but
> with branching my model was Andres' sketch here:
> 
> https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76one%40awork3.anarazel.de

Oh, nifty.  IIUC this should help avoid quite a bit of overhead even before
adding the PCLMUL stuff since it removes the function pointers for the
runtime-check builds.

While this needn't block this patch set, I do find the dispatch code to be
pretty complicated.  Maybe we can improve that in the future by using
macros to auto-generate much of it.  My concern here is less about this
particular patch set and more about the long term maintainability as we add
more and more stuff like it, each with its own tangled web of build and
dispatch rules.

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi John, 

I added dispatch logic for the new pclmul version on top of your v5 (which seems outdated now and so I will refrain
fromposting a patch here). Could you take a look at the approach here to see if this makes sense?  The logic in the
pg_comp_crc32c_choosefunction is the main change and seems a lot cleaner to me. 
 

See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/ 

Raghuveer

> -----Original Message-----
> From: John Naylor <johncnaylorls@gmail.com>
> Sent: Monday, February 17, 2025 10:40 PM
> To: Nathan Bossart <nathandbossart@gmail.com>
> Cc: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>; pgsql-
> hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com>
> Subject: Re: Improve CRC32C performance on SSE4.2
> 
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandbossart@gmail.com>
> wrote:
> >
> > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > > I tried using branching for the runtime check, and this looks like
> > > the way to go:
> > > - Existing -msse4.2 builders will still call directly, but inside
> > > the function there is a length check and only for long input will it
> > > do a runtime check for pclmul.
> > > - This smooths the way for -msse4.2 (and the equivalent on Arm) to
> > > inline calls with short constant input (e.g. WAL insert lock),
> > > although I've not done that here.
> > > - This can be a simple starting point for consolidating runtime
> > > checks, as was proposed for popcount in the AVX-512 CRC thread, but
> > > with branching my model was Andres' sketch here:
> > >
> > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76on
> > > e%40awork3.anarazel.de
> >
> > Oh, nifty.  IIUC this should help avoid quite a bit of overhead even
> > before adding the PCLMUL stuff since it removes the function pointers
> > for the runtime-check builds.
> 
> I figured the same, but it doesn't seem to help on the microbenchmark.
> (I also tried the pg_waldump benchmark you used, but I couldn't get it working so
> I'm probably missing a step.)
> 
> master:
> 
> 20
> latency average = 3.958 ms
> latency average = 3.860 ms
> latency average = 3.857 ms
> 32
> latency average = 5.921 ms
> latency average = 5.166 ms
> latency average = 5.128 ms
> 48
> latency average = 6.384 ms
> latency average = 6.022 ms
> latency average = 5.991 ms
> 
> v6:
> 
> 20
> latency average = 4.084 ms
> latency average = 3.879 ms
> latency average = 3.896 ms
> 32
> latency average = 5.533 ms
> latency average = 5.536 ms
> latency average = 5.573 ms
> 48
> latency average = 6.201 ms
> latency average = 6.205 ms
> latency average = 6.111 ms
> 
> > While this needn't block this patch set, I do find the dispatch code
> > to be pretty complicated.  Maybe we can improve that in the future by
> > using macros to auto-generate much of it.  My concern here is less
> > about this particular patch set and more about the long term
> > maintainability as we add more and more stuff like it, each with its
> > own tangled web of build and dispatch rules.
> 
> I think the runtime dispatch is fairly simple for this case. To my taste, the worse
> maintainability hazard is the building. To make that better, I'd do this:
> - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> unconditionally for each platform
> - Initialize the runtime info by CPU platform and not other symbols where possible
> (I guess anything needing AVX-512 will still be a mess)
> - Put all hardware CRC .c files into a single file pg_crc32c_hw.c.
> Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use
> by platform and have a separate pg_crc32c_hw_impl.h header that has the
> simple loop with these macros. That header would for now only be included into
> pg_crc32c_hw.c.
> 
> The first two could be done as part of this patch or soon after. The third is a bit
> more invasive, but seems like a necessary prerequisite for inlining on small
> constant input, to keep that from being a mess.
> 
> There's another potential irritation for maintenance too: I spent too much time
> only adding pg_cpucap_initialize() to frontend main() functions that need it. I
> should have just added them to every binary.
> We don't add new programs often, but it's still less automatic than the function
> pointer way, so I'm open to other ideas.
> 
> Attached v7 to fix CI failures.
> --
> John Naylor
> Amazon Web Services

RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi John, 

Just a few comments on v7: 

> pg_cpucap_crc32c

I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy
toextend in the future and keeps the functionalities separate. 
 

> - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> unconditionally for each platform

+1. Sounds perfect. We should also move the avx512 runtime detection of popcount here. The runtime detection code could
alsobe appended with function __attribute__((constructor)) so that it gets executed before main. 
 

> I think the runtime dispatch is fairly simple for this case.
+ pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
+ {
+ ....
+ #ifdef HAVE_PCLMUL_RUNTIME
+   if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL))

IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of
thepg_crc32c.h to it own pg_crc32c_dispatch.c file. 
 

Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm happy
toprovide feedback and review your work, or can continue development work based on any feedback. Please let me know
whichyou'd prefer.
 

Raghuveer


Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
> The runtime detection code could also be appended with function __attribute__((constructor)) so that it gets executed
beforemain. 

Hmm! I didn't know about that, thanks! It works on old gcc/clang, so
that's good. I can't verify online if Arm etc. executes properly since
the execute button is greyed out, but I don't see why it wouldn't:

https://godbolt.org/z/3as9MGr3G

Then we could have:

#ifdef FRONTEND
pg_attribute_constructor()
#endif
void
pg_cpucap_initialize(void)
{
  ...
}

Now, there is no equivalent on MSVC and workarounds are fragile [1].
Maybe we could only assert initialization happened for the backend and
for frontend either
- add a couple manual initializations to to the frontend programs
where we don't want to lose performance for non-gcc/clang.
- require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by
Thomas M.'s earlier findings on popcount (also SSE4.2) [2]

The first is less risky but less tidy.

> > I think the runtime dispatch is fairly simple for this case.
> + pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
> + {
> + ....
> + #ifdef HAVE_PCLMUL_RUNTIME
> +   if (len >= PCLMUL_THRESHOLD && (pg_cpucap & PGCPUCAP_CLMUL))
>
> IMO, the pclmul and sse4.2 versions should be dispatched independently and the dispatch logic should be moved out of
thepg_crc32c.h to it own pg_crc32c_dispatch.c file. 

That makes sense, but it does close the door on future inlining.

> Also, thank you for taking lead on developing this patch. Since the latest patch seems to be in a good shape, I'm
happyto provide feedback and review your work, or can continue development work based on any feedback. Please let me
knowwhich you'd prefer. 

Sure. Depending on any feedback on the above constructor technique,
I'll work on the following, since I've prototyped most of the first
and the second is mostly copy-and-paste from your earlier work:

> > pg_cpucap_crc32c
>
> I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy
toextend in the future and keeps the functionalities separate. 
>
> > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> > unconditionally for each platform
>
> +1. Sounds perfect. We should also move the avx512 runtime detection of popcount here.

[1] https://stackoverflow.com/questions/1113409/attribute-constructor-equivalent-in-vc
[2] https://www.postgresql.org/message-id/CA+hUKGKS64zJezV9y9mPcB-J0i+fLGiv3FAdwSH_3SCaVdrjyQ@mail.gmail.com

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> Now, there is no equivalent on MSVC and workarounds are fragile [1].
> Maybe we could only assert initialization happened for the backend and for
> frontend either
> - add a couple manual initializations to to the frontend programs where we don't
> want to lose performance for non-gcc/clang.
> - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas
> M.'s earlier findings on popcount (also SSE4.2) [2]
> 
> The first is less risky but less tidy.

Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my strong
suit:/ 
 

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
>
> > Now, there is no equivalent on MSVC and workarounds are fragile [1].
> > Maybe we could only assert initialization happened for the backend and for
> > frontend either
> > - add a couple manual initializations to to the frontend programs where we don't
> > want to lose performance for non-gcc/clang.
> > - require CRC on x86-64 MSVC since Windows 10 is EOL soon, going by Thomas
> > M.'s earlier findings on popcount (also SSE4.2) [2]
> >
> > The first is less risky but less tidy.
>
> Agree, let me think about this but not sure if I have any useful suggestions here. MSVC is unfortunately not my
strongsuit :/ 

Here's another idea to make it more automatic: Give up on initializing
every capability at once. The first time we call CRC, it will be
uninitialized, so this part:

if (pg_cpucap & PGCPUCAP_CRC32C)
  return COMP_CRC32C_HW(crc, data, len);
else
  return pg_comp_crc32c_sb8(crc, data, len);

...will call the SB8 path. Inside there, do the check:

#if defined(HAVE_CRC_RUNTIME)
// separate init bit for each capability
if (unlikely(pg_cpucap & PGCPUCAP_CRC32C_INIT == 0))
{
  pg_cpucap_crc32c(); // also sets PGCPUCAP_CRC32C_INIT
  if (pg_cpucap & PGCPUCAP_CRC32C)
    return COMP_CRC32C_HW(crc, data, len);
}
#endif
// ...fallthrough to SB8

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> Here's another idea to make it more automatic: Give up on initializing every
> capability at once. 

I'm not sure I like giving up this. Initializing and running CPUID check with the attribute constructor is very
valuablefor two reasons: (1) you get everything done at load time before main and (2) you don’t have to run cpuid check
forevery feature (popcount, crc32c, or anything else you add in the future) multiple times. It keep the cpuid
functionalityin a central place that makes it a modular design. 
 

On MSVC, we could have the first SIMD feature call pg_cpucap_initialize() which runs CPUID stores the cpu features. Any
subsequentcall can skip (because it has already been initialized) by using a static variable or some other approach.
Doesthis make sense? 
 

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Feb 25, 2025 at 3:17 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> > Here's another idea to make it more automatic: Give up on initializing every
> > capability at once.
>
> I'm not sure I like giving up this. Initializing and running CPUID check with the attribute constructor is very
valuablefor two reasons: (1) you get everything done at load time before main and (2) you don’t have to run cpuid check
forevery feature (popcount, crc32c, or anything else you add in the future) multiple times. It keep the cpuid
functionalityin a central place that makes it a modular design. 

I agree it would be preferable to make a centralized check work.

> On MSVC, we could have the first SIMD feature call pg_cpucap_initialize() which runs CPUID stores the cpu features.
Anysubsequent call can skip (because it has already been initialized) by using a static variable or some other
approach.Does this make sense? 

Correct me if I'm misunderstanding, but this sounds like in every
frontend program we'd need to know what the first call was, which
seems less maintainable than just initializing at the start of every
frontend program.


--
John Naylor
Amazon Web Services



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Feb 18, 2025 at 1:40 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:

> > While this needn't block this patch set, I do find the dispatch code to be
> > pretty complicated.  Maybe we can improve that in the future by using
> > macros to auto-generate much of it.  My concern here is less about this
> > particular patch set and more about the long term maintainability as we add
> > more and more stuff like it, each with its own tangled web of build and
> > dispatch rules.

I had a further thought on this: CRC and non-vector popcount are kind
of special in that recent OSes assume they exist, and it's worth a bit
of effort to take advantage of that. Other things we may add should be
kept as simple as possible.

> - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build
> them unconditionally for each platform
> - Initialize the runtime info by CPU platform and not other symbols
> where possible (I guess anything needing AVX-512 will still be a mess)

I've made a start of this for v8:

0001 is mostly the same as before
0002 (Meson-only for now) changes 0001 per the above, to see how it
looks, but I've not tried to add popcount or anything else. I like it
overall, but some details may need tweaking.
0004 generates the pclmul loop slightly differently to simplify
integrating with our code, but shouldn't make a big difference

Another thing I found in Agner's manuals: AMD Zen, even as recently as
Zen 4, don't have as good a microarchitecture for PCLMUL, so if anyone
with such a machine would like to help test the cutoff, the script is
at

https://www.postgresql.org/message-id/CANWCAZahvhE-%2BhtZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ%40mail.gmail.com

(needs "CREATE EXTENSION test_crc32c;" to run it)

--
John Naylor
Amazon Web Services

Attachment

RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> I agree it would be preferable to make a centralized check work.

Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the
AVX512popcount code. Any new feature that requires CPUID information can access that information with
pg_cpu_have[FEATURE]defined in pg_cpucap.h and initialized in pg_cpucap.c. v8 also had a typo in configure files which
causeda build failure. Its fixed in v9.
 

Pretty sure the ARM code paths need some correction. Let me know what you think.
 
> Correct me if I'm misunderstanding, but this sounds like in every frontend
> program we'd need to know what the first call was, which seems less
> maintainable than just initializing at the start of every frontend program.

No, sorry for the confusion but that is not what I meant. Lets ignore the attribute constructor for now. We can
probablyrevisit this at a later point.
 

Raghuveer

Attachment

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> > I agree it would be preferable to make a centralized check work.
>
> Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the
AVX512popcount code. Any new feature that requires CPUID information can access that information with
pg_cpu_have[FEATURE]defined in pg_cpucap.h and initialized in pg_cpucap.c. v8 also had a typo in configure files which
causeda build failure. Its fixed in v9. 
>
> Pretty sure the ARM code paths need some correction. Let me know what you think.

Thanks, I think this is a good direction. Some comments:

+static void pg_cpuid(int leaf, int subleaf, unsigned int* exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, leaf, subleaf);
+#else
+#error cpuid instruction not available
+#endif
+}

Our current configure still looks for __get_cpuid and __cpuid. We
committed checking these new ones fairly recently, and they were
further gated by USE_AVX512_POPCNT_WITH_RUNTIME_CHECK. It seems like
here we should do something like the following, where "+" lines are
from the patch and other lines are mine:

+static void
+pg_cpucap_x86(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID)
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
__cpuid(exx, 1);
#endif

+    pg_cpucap[PG_CPU_FEATURE_SSE42] = (exx[2] & (1 << 20)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_PCLMUL] = (exx[2] & (1 << 1)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_POPCNT] = (exx[2] & (1 << 23)) != 0;
+    /* osxsave */
+    if ((exx[2] & (1 << 27)) == 0) {
+        return;
+    }
+    /* avx512 os support */
+    if (zmm_regs_available()) {
+        return;
+    }

// BTW, I do like the gating here that reduces the number of places
that have to know about zmm and xsave. (Side note: shouldn't that be
"if !(zmm_regs_available())"?)

+    /* reset for second cpuid call on leaf 7 to check extended avx512
support */

exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID_COUNT)
    __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUIDEX)
    __cpuidex(exx, 7, 0);
#endif

+    pg_cpucap[PG_CPU_FEATURE_AVX512F]  = (exx[1] & (1 << 16)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_AVX512BW]        = (exx[1] & (1 << 30)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_AVX512VPOPCNTDQ] = (exx[2] & (1 << 14)) != 0;
+
+}

What do you think?

-#define PGCPUCAP_INIT           (1 << 0)
-#define PGCPUCAP_POPCNT         (1 << 1)
-#define PGCPUCAP_VPOPCNT        (1 << 2)
-#define PGCPUCAP_CRC32C         (1 << 3)
-#define PGCPUCAP_CLMUL          (1 << 4)
+enum pg_cpucap__
+{
+    PG_CPU_FEATURE_INIT              = 0,
+    // X86
+    PG_CPU_FEATURE_SSE42             = 1,
+    PG_CPU_FEATURE_POPCNT            = 2,
+    PG_CPU_FEATURE_PCLMUL            = 3,
[...]
+    PG_CPU_FEATURE_ARMV8_CRC32C      = 100,

I'm not a fan of exposing these architecture-specific details to
places that consult the capabilities. That requires things like

+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
[...]
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C)

I'd prefer to have 1 capability <-> one place to check at runtime for
architectures that need that, and to keep architecture details private
to the initialization step. . Even for things that test for which
function pointer to use, I think it's a cleaner interface to look at
one thing.

+static void
+pg_cpucap_arch()
+{
+    /* WIP: configure checks */
+#ifdef __x86_64__
+    pg_cpucap_x86();
+#else // ARM:
+    pg_cpucap_arm();
+#endif
+}

If we're going to have a single file for the init step, we don't need
this -- we'd just have a different definition of
pg_cpucap_initialize() in each part, with a default that only adds the
"init" slot:

#if defined( __i386__ ) || defined(i386) || defined(_M_IX86) ||
  defined(__x86_64__) || defined(__x86_64) || defined(_M_AMD64)

<cpuid checks>

#elif defined(__arm__) || defined(__arm) ||
   defined(__aarch64__) || defined(_M_ARM64)

<for now only pg_crc32c_armv8_available()>

#else
<only init>
#endif

--

John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Attached v10 with minor changes based on the feedback. 

> Thanks, I think this is a good direction. Some comments:
> 
> #if defined(HAVE__GET_CPUID)
> __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID)
> __cpuid(exx, 1); #endif

Oh yeah good catch. Fixed in v10. 

> +    }
> +    /* avx512 os support */
> +    if (zmm_regs_available()) {
> +        return;
> +    }
> 
> // BTW, I do like the gating here that reduces the number of places that have to
> know about zmm and xsave. (Side note: shouldn't that be "if
> !(zmm_regs_available())"?)

Yup, good catch again 😊 

> What do you think?

Yup, those look correct to me. Fixed them in v10. 

> +    PG_CPU_FEATURE_PCLMUL            = 3,
> [...]
> +    PG_CPU_FEATURE_ARMV8_CRC32C      = 100,
> 
> I'm not a fan of exposing these architecture-specific details to places that consult
> the capabilities. That requires things like
> 
> +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
> [...]
> +#define PGCPUCAP_CRC32C
> pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C)
> 
> I'd prefer to have 1 capability <-> one place to check at runtime for architectures
> that need that, and to keep architecture details private to the initialization step. .
> Even for things that test for which function pointer to use, I think it's a cleaner
> interface to look at one thing.

Isn't that one thing currently pg_cpu_have(FEATURE)? The reason we have the additional PGCPUCAP_CRC32C is because the
dispatchfor CRC32C is currently defined in the header file common to both ARM and SSE42. We should ideally have
separatedispatch for ARM and x86 in their own files (the way it is in the main branch). This also makes it easier to
addmore implementations in the future without having to make the dispatch function work for both ARM and x86. 
 

> If we're going to have a single file for the init step, we don't need this -- we'd just
> have a different definition of
> pg_cpucap_initialize() in each part, with a default that only adds the "init" slot:
> 
> #if defined( __i386__ ) || defined(i386) || defined(_M_IX86) ||
>   defined(__x86_64__) || defined(__x86_64) || defined(_M_AMD64)
> 
> <cpuid checks>
> 
> #elif defined(__arm__) || defined(__arm) ||
>    defined(__aarch64__) || defined(_M_ARM64)
> 
> <for now only pg_crc32c_armv8_available()>
> 
> #else
> <only init>
> #endif

Makes sense. Got rid of it in v10. 

Raghuveer 

Attachment

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Thu, Feb 27, 2025 at 3:53 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

> > I'm not a fan of exposing these architecture-specific details to places that consult
> > the capabilities. That requires things like
> >
> > +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
> > [...]
> > +#define PGCPUCAP_CRC32C
> > pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C)
> >
> > I'd prefer to have 1 capability <-> one place to check at runtime for architectures
> > that need that, and to keep architecture details private to the initialization step. .
> > Even for things that test for which function pointer to use, I think it's a cleaner
> > interface to look at one thing.
>
> Isn't that one thing currently pg_cpu_have(FEATURE)?

No, I mean the one thing would be "do we have hardware CRC?", and each
architecture would set (and check if needed) the same slot.

I'm not 100% sure this is the better way, and there may be a
disadvantage I haven't thought of, but this makes the most sense to
me. I'm willing to hear reasons to do it differently, but I'm thinking
those reasons need to be weighed against the loss of abstraction.

> The reason we have the additional PGCPUCAP_CRC32C is because the dispatch for CRC32C is currently defined in the
headerfile common to both ARM and SSE42. We should ideally have separate dispatch for ARM and x86 in their own files
(theway it is in the main branch). 

Just so we're on the same page, the current separate files do
initialization, not dispatch. I'm seeing conflicting design goals
here:

1. Have multiple similar dispatch functions with the only difference
(for now) being certain names (not in the patch, just discussed).
2. Put initialization routines in the same file, even though they have
literally nothing in common.

> This also makes it easier to add more implementations in the future without having to make the dispatch function work
forboth ARM and x86. 

I don't quite understand, since with 0001 it already works (at least
in theory) for 3 architectures with hardware CRC, plus those without,
and I don't see it changing with more architectures.

+/* Access to pg_cpucap for modules that need runtime CPUID information */
+bool pg_cpu_have(int feature_id)
+{
+    return pg_cpucap[feature_id];
 }

Putting this in a separate translation may have to do the decision to
turn v8's #defines into an enum? In any case, this means doing a
function call to decide which function to call. That makes no sense.

The goal of v9/10 was to centralize the initialization from cpuid etc,
but it also did a major re-architecture of the runtime-checks at the
same. That made review more difficult.

+#if defined(__x86_64__) || defined ...
+    pg_cpucap_x86();
+#elif defined(__aarch64__) || defined ...
+    pg_cpucap_arm();
+#endif

I view this another conflict in design goals: After putting everything
in the same file, the routines still have different names and have to
be guarded accordingly. I don't really see the advantage, especially
since these guard symbols don't even match the ones that guard the
actual initialization steps. I tried to imply that in my last review,
but maybe I should have been more explicit.

I think the least painful step is to take the x86 initialization from
v10, which is looking great, but
- keep separate initialization files
- don't whack around the runtime representation, at least not in the same patch

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi John, 

Going back to your earlier comment: 

> > > I'm not a fan of exposing these architecture-specific details to
> > > places that consult the capabilities. That requires things like
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)

> > Isn't that one thing currently pg_cpu_have(FEATURE)?
> 
> No, I mean the one thing would be "do we have hardware CRC?", and each
> architecture would set (and check if needed) the same slot.
> 
> I'm not 100% sure this is the better way, and there may be a disadvantage I
> haven't thought of, but this makes the most sense to me. I'm willing to hear
> reasons to do it differently, but I'm thinking those reasons need to be weighed
> against the loss of abstraction.

IMO, CPU SIMD (SSE4.2, AVX, etc.) features are a module of their own separate from capabilities/features supported in
postgres(CRC32C, bitcount, etc.). Exposing architecture-specific details to the features that need to check against
themis a way to make code more modular and reusable. It is reasonable to expect developers writing SIMD specific
functionsto naturally understand the runtime architecture requirements for that function which is easily accessible by
justchecking the value of pg_cpu_have(PG_CPU_FEATURE_..). If another feature in postgres requires SSE4.2, then the CPU
initializationmodule doesn't need to be modified at all. They just have to worry about their feature and its CPU
requirements.
 

> Just so we're on the same page, the current separate files do initialization, not
> dispatch. I'm seeing conflicting design goals
> here:
> 
> 1. Have multiple similar dispatch functions with the only difference (for now)
> being certain names (not in the patch, just discussed).
> 2. Put initialization routines in the same file, even though they have literally
> nothing in common.

Sorry for not being clear. I will move the initialization routines to separate files. Just haven’t gotten to it yet
withv10. 
 

> 
> > This also makes it easier to add more implementations in the future without
> having to make the dispatch function work for both ARM and x86.
> 
> I don't quite understand, since with 0001 it already works (at least in theory) for 3
> architectures with hardware CRC, plus those without, and I don't see it changing
> with more architectures.

The reason its working across all architectures as of now is because there is only one runtime check for CRC32C feature
acrossx86, ARM and LongArch. (PCLMUL dispatch happens inside the SSE42). If and when we add the AVX-512 implementation,
won'tthe dispatch logic will look different for x86 and ARM? Along with that, the header file pg_crc32c.h will also
looka lot messier with a whole bunch of nested #ifdefs.  IMO, the header file should be devoid of any architecture
dispatchlogic and simply contain declarations of various implementations (see
https://github.com/r-devulap/postgressql/blob/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/src/include/port/pg_crc32c.hfor
example).The dispatch logic should be handled separately in a C file. 
 

> 
> +/* Access to pg_cpucap for modules that need runtime CPUID information
> +*/ bool pg_cpu_have(int feature_id) {
> +    return pg_cpucap[feature_id];
>  }
> 
> Putting this in a separate translation may have to do the decision to turn v8's
> #defines into an enum? In any case, this means doing a function call to decide
> which function to call. That makes no sense.

The reason it is a separate translational unit is because I intended pg_cpucap to be a read only variable outside of
pg_cpucap.c.If the function overhead is not preferred, then I can get rid of it.
 

> 
> The goal of v9/10 was to centralize the initialization from cpuid etc, but it also did
> a major re-architecture of the runtime-checks at the same. That made review
> more difficult.

I assumed we wanted to move the runtime checks to a central place and that needed this re-architecture. 

> 
> +#if defined(__x86_64__) || defined ...
> +    pg_cpucap_x86();
> +#elif defined(__aarch64__) || defined ...
> +    pg_cpucap_arm();
> +#endif
> 
> I view this another conflict in design goals: After putting everything in the same
> file, the routines still have different names and have to be guarded accordingly. I
> don't really see the advantage, especially since these guard symbols don't even
> match the ones that guard the actual initialization steps. I tried to imply that in
> my last review, but maybe I should have been more explicit.
> 
> I think the least painful step is to take the x86 initialization from v10, which is
> looking great, but
> - keep separate initialization files
> - don't whack around the runtime representation, at least not in the same patch

Sure. I can fix that in v11. 

Raghuveer





Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
Hi Raghuveer,

You raised some interesting points, which deserve a thoughtful
response. After sleeping on it, however I came to the conclusion that
a sweeping change in runtime checks, with either of our approaches,
has downsides and unresolved questions. Perhaps we can come back to it
at a later time. For this release cycle, I took a step back and tried
to think of the least invasive way to solve the immediate problem,
which is: How to allow existing builds with "-msse4.2" to take
advantage of CLMUL while not adding overhead. Here's what I came up
with in v11:

0001: same benchmark module as before
0002: For SSE4.2 builds, arrange so that constant input uses an
inlined path so that the compiler can emit unrolled loops anywhere.
This is particularly important for the WAL insertion lock, so this is
possibly committable on its own just for that.
0003: the PCLMUL path, only for runtime-check builds
0004: the PCLMUL path for SSE4.2 builds. This uses a function pointer
for long-ish input and the same above inlined path for short input
(whether constant or not). So it gets the best of both worlds.

There is also a separate issue:

On Tue, Feb 25, 2025 at 6:05 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> Another thing I found in Agner's manuals: AMD Zen, even as recently as
> Zen 4, don't have as good a microarchitecture for PCLMUL, so if anyone
> with such a machine would like to help test the cutoff

David Rowley shared some results off-list, which are: Zen 4 is very
good with this algorithm even at 64 bytes input length, but Zen 2
regresses up to maybe 256 bytes. Having a large cutoff to cover all
bases makes this less useful, and that was one of my reservations
about AVX-512. However, with the corsix generator I found it's
possible to specify AVX-512 with a single accumulator (rather than 4),
which still gives a minimum input of 64 bytes, so I'll plan to put
something together to demonstrate.

(Older machines could use the 3-way stream algorithm as a fallback on
long inputs, as I've mentioned elsewhere, assuming that's legally
unencumbered.)

--
John Naylor
Amazon Web Services

Attachment

Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Fri, Feb 28, 2025 at 07:11:29PM +0700, John Naylor wrote:
> 0002: For SSE4.2 builds, arrange so that constant input uses an
> inlined path so that the compiler can emit unrolled loops anywhere.
> This is particularly important for the WAL insertion lock, so this is
> possibly committable on its own just for that.

Nice.

> 0004: the PCLMUL path for SSE4.2 builds. This uses a function pointer
> for long-ish input and the same above inlined path for short input
> (whether constant or not). So it gets the best of both worlds.

I spent some time staring at pg_crc32.h with all these patches applied, and
IIUC it leads to the following behavior:

* For compiled-in SSE 4.2 builds, we branch based on the length.  For
  smaller inputs, we are using an inlined version of the SSE 4.2 code.
  For larger inputs, we call a function pointer so that we can potentially
  use the PCLMUL version.  This could potentially lead to a small
  regression for machines with SSE 4.2 but not PCLMUL, but that may be
  uncommon enough at this point to not worry aobut.

* For runtime-check SSE 4.2 builds, we choose slicing-by-8, SSE 4.2, or
  SSE 4.2 with PCLMUL, and we always use a function pointer.

The main question I have is whether we can simplify this by always using a
runtime check and by inlining slicing-by-8 for small inputs.  That would be
dependent on the performance of slicing-by-8 and SSE 4.2 being comparable
for small inputs.

Overall, I wish we could avoid splitting things into separate files and
adding more header file gymnastics, but maybe there isn't much better we
can do without overhauling the CPU feature detection code.

-- 
nathan



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hi John, 

> You raised some interesting points, which deserve a thoughtful response. After
> sleeping on it, however I came to the conclusion that a sweeping change in
> runtime checks, with either of our approaches, has downsides and unresolved
> questions. Perhaps we can come back to it at a later time.

Sounds good to me. I did get derailed into something beyond the scope of this patch. I will make a separate proposal
oncethis is merged.
 

> Here's what I came up with in v11:

Some feedback on v11:

    if ((exx[2] & (1 << 20)) != 0)  /* SSE 4.2 */
    {
        pg_comp_crc32c = pg_comp_crc32c_sse42;
#ifdef USE_PCLMUL_WITH_RUNTIME_CHECK
        if ((exx[2] & (1 << 1)) != 0)   /* PCLMUL */
            pg_comp_crc32c = pg_comp_crc32c_pclmul;
#endif
    }
#ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
    else
        pg_comp_crc32c = pg_comp_crc32c_sb8;
#endif

Is the #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK at the right place? Shouldn’t it guard SSE4.2 function pointer
assignment?
 

/* WIP: configure checks */
#ifdef __x86_64__
#define USE_PCLMUL_WITH_RUNTIME_CHECK
#endif

Minor consideration: gcc 4.3 (released in 2011) is the only compiler that supports -msse4.2 and not -mpclmul. gcc >=
4.4supports both. If you are okay causing a regression on gcc4.3, we could combine USE_PCLMUL_WITH_RUNTIME_CHECK with
USE_SSE42_CRC32C_WITH_RUNTIME_CHECKinto a single macro to reduce the number of #ifdef's in the codebase and simplify
configure/mesoncompiler checks.
 
 
> 0001: same benchmark module as before
> 0002: For SSE4.2 builds, arrange so that constant input uses an inlined path so
> that the compiler can emit unrolled loops anywhere.

When building with meson, it looks like we build with -O2 and that is not enough for the compiler to unroll the SSE42
CRC32Cloop. It requires -O3 or -O2 with -funroll-loops (see https://gcc.godbolt.org/z/4Eaq981aT). Perhaps we should
checkdisassembly to see if the unroll is really happening on constant input?
 

Also, the reason you have pg_comp_crc32c_sse42_inline defined separately in a header file is because you want to (a)
inlinethe function and (b) unroll for constant inputs. Couldn't both of these be achieved by adding function
__attribute__((always_inline))on the pg_comp_crc32c_sse42 function with the added -funroll-loops compiler flag? 
 

Raghuveer



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

> I spent some time staring at pg_crc32.h with all these patches applied, and
> IIUC it leads to the following behavior:
>
> * For compiled-in SSE 4.2 builds, we branch based on the length.  For
>   smaller inputs, we are using an inlined version of the SSE 4.2 code.
>   For larger inputs, we call a function pointer so that we can potentially
>   use the PCLMUL version.

Right. For WAL, my hope is that the inlined path would balance out the
path with the function pointer, particularly since the computation for
the 20-byte header would be both inlined and unrolled, as I see here
in XLogInsertRecord():

crc32  rax,QWORD PTR [rsi]
crc32  rax,rbx                  ; <- newly calculated xl_prev
crc32  eax,DWORD PTR [rsi+0x10]


>  This could potentially lead to a small
>   regression for machines with SSE 4.2 but not PCLMUL, but that may be
>   uncommon enough at this point to not worry aobut.

Note also upthread I mentioned we may have to go to 512-bit pclmul,
since Zen 2 regresses on 128-bit. :-(

I actually haven't seen any measurable difference with direct calls
versus indirect, but it could very well be that the microbenchmark is
hiding that since it's doing something unnatural by calling things a
bunch of times in a loop. I want to try changing the benchmark to base
the address it's computing on some bits from the crc from the last
loop iteration. I think that would make it more latency-sensitive. We
could also make it do an additional constant 20-byte input every time
to make it resemble WAL more closely.

> * For runtime-check SSE 4.2 builds, we choose slicing-by-8, SSE 4.2, or
>   SSE 4.2 with PCLMUL, and we always use a function pointer.
>
> The main question I have is whether we can simplify this by always using a
> runtime check and by inlining slicing-by-8 for small inputs.  That would be
> dependent on the performance of slicing-by-8 and SSE 4.2 being comparable
> for small inputs.

Slicing-by-8 needs one lookup and one XOR per byte of input, and other
overheads, so I think it would still be very slow.

> Overall, I wish we could avoid splitting things into separate files and
> adding more header file gymnastics, but maybe there isn't much better we
> can do without overhauling the CPU feature detection code.

Yeah, it seems all ideas so far have something unattractive about them. :-(

--
John Naylor
Amazon Web Services



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Mar 4, 2025 at 5:41 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
> Some feedback on v11:
>
>     if ((exx[2] & (1 << 20)) != 0)  /* SSE 4.2 */
>     {
>         pg_comp_crc32c = pg_comp_crc32c_sse42;
> #ifdef USE_PCLMUL_WITH_RUNTIME_CHECK
>         if ((exx[2] & (1 << 1)) != 0)   /* PCLMUL */
>             pg_comp_crc32c = pg_comp_crc32c_pclmul;
> #endif
>     }
> #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
>     else
>         pg_comp_crc32c = pg_comp_crc32c_sb8;
> #endif
>
> Is the #ifdef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK at the right place? Shouldn’t it guard SSE4.2 function pointer
assignment?

Without this guard, SSE builds will complain during link time of an
undefined reference to pg_comp_crc32c_sb8. We could instead just build
that file everywhere for simplicity, but it'll just take up space and
never get called. (Maybe that's okay because with the
runtime-branching approach we experimented with earlier, it would
always have to be built.)

> When building with meson, it looks like we build with -O2 and that is not enough for the compiler to unroll the SSE42
CRC32Cloop.It requires -O3 or -O2 with -funroll-loops (see https://gcc.godbolt.org/z/4Eaq981aT). Perhaps we should
checkdisassembly to see if the unroll is really happening on constant input? 

That example uses 32 bytes -- fiddling with it a bit, for 31 or less
it gets unrolled. The case we care about most is currently 20 bytes.

We don't want to force unrolling loops in the general case -- that's
normally used for longer input and this is for short input.

> Also, the reason you have pg_comp_crc32c_sse42_inline defined separately in a header file is because you want to (a)
inlinethe function and (b) unroll for constant inputs. Couldn't both of these be achieved by adding function
__attribute__((always_inline))on the pg_comp_crc32c_sse42 function with the added -funroll-loops compiler flag? 

And (c) prevent it from being inlined in builds that need a runtime check.

I briefly tried the attribute approach and it doesn't work for me. If
you can get it to work, go ahead and share how that's done, but keep
in mind that we're not gcc/clang only -- it also has to work for
MSVC's "__forceinline"...

--
John Naylor
Amazon Web Services



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> This could potentially lead to a small regression for machines with SSE
>> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
>> worry aobut.
> 
> Note also upthread I mentioned we may have to go to 512-bit pclmul,
> since Zen 2 regresses on 128-bit. :-(

Ah, okay.  You mean the AVX-512 version [0]?  And are you thinking we'd use
the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the
SSE4.2 version for small inputs and use a function pointer for larger ones?

> I actually haven't seen any measurable difference with direct calls
> versus indirect, but it could very well be that the microbenchmark is
> hiding that since it's doing something unnatural by calling things a
> bunch of times in a loop. I want to try changing the benchmark to base
> the address it's computing on some bits from the crc from the last
> loop iteration. I think that would make it more latency-sensitive. We
> could also make it do an additional constant 20-byte input every time
> to make it resemble WAL more closely.

Looking back on some old benchmarks for small-ish inputs [0], the
difference does seem within the noise range.  I suppose these functions
might be expensive enough to make the function pointer overhead negligible.
IME there's a big difference when a function pointer is used for an
instruction or two [2], but even relatively small inputs to the CRC-32C
functions might require several instructions.

>> The main question I have is whether we can simplify this by always using a
>> runtime check and by inlining slicing-by-8 for small inputs.  That would be
>> dependent on the performance of slicing-by-8 and SSE 4.2 being comparable
>> for small inputs.
> 
> Slicing-by-8 needs one lookup and one XOR per byte of input, and other
> overheads, so I think it would still be very slow.

That's my suspicion, too.

[0] https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192%40BL1PR11MB5304.namprd11.prod.outlook.com
[1] https://postgr.es/m/20231031033601.GA68409%40nathanxps13
[2] https://postgr.es/m/CAApHDvqyMNGVgwpaOPtENdq5uEMR%3DvSkRJEgG1S%2BX7Vtk1-EnA%40mail.gmail.com

-- 
nathan



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> This could potentially lead to a small regression for machines with SSE
> >> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
> >> worry aobut.
> >
> > Note also upthread I mentioned we may have to go to 512-bit pclmul,
> > since Zen 2 regresses on 128-bit. :-(
>
> Ah, okay.  You mean the AVX-512 version [0]?

Right, except not that version, rather a more efficient way and with
only one accumulator, so still a minimum length of 64 bytes. I'll
share that once we have agreement on detection/dispatch.

> And are you thinking we'd use
> the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the
> SSE4.2 version for small inputs and use a function pointer for larger ones?

Yes. Although, we may not even have to inline for non-constant input,
see below. Inlining loops does take binary space.

> > I actually haven't seen any measurable difference with direct calls
> > versus indirect, but it could very well be that the microbenchmark is
> > hiding that since it's doing something unnatural by calling things a
> > bunch of times in a loop. I want to try changing the benchmark to base
> > the address it's computing on some bits from the crc from the last
> > loop iteration. I think that would make it more latency-sensitive. We
> > could also make it do an additional constant 20-byte input every time
> > to make it resemble WAL more closely.
>
> Looking back on some old benchmarks for small-ish inputs [0], the
> difference does seem within the noise range.  I suppose these functions
> might be expensive enough to make the function pointer overhead negligible.
> IME there's a big difference when a function pointer is used for an
> instruction or two [2], but even relatively small inputs to the CRC-32C
> functions might require several instructions.

That was my hunch too, but I wanted to be more sure, so I modified the
benchmark so it doesn't know the address of the next calculation until
it finishes the last calculation so we can hopefully see the latency
caused by indirection. It also does an additional calculation on
constant 20 bytes, like the WAL header. I also tweaked the length each
iteration so the branch predictor maybe has a harder time predicting
the constant 20 input. And to make it more challenging, I removed the
part that inlined all small inputs, so it inlines only constant
inputs:

0001+0002 (test only)

func pointer:

32
latency average = 24.021 ms
latency average = 24.020 ms
latency average = 23.733 ms
40
latency average = 25.018 ms
latency average = 24.253 ms
latency average = 24.278 ms
48
latency average = 25.437 ms
latency average = 24.817 ms
latency average = 24.793 ms

SSE4.2 build (direct func):

32
latency average = 22.422 ms
latency average = 22.387 ms
latency average = 22.391 ms
40
latency average = 23.444 ms
latency average = 22.887 ms
latency average = 22.988 ms
48
latency average = 23.432 ms
latency average = 23.380 ms
latency average = 23.384 ms

0001-0006
SSE 4.2 build (inlined constant / otherwise func pointer)

32
latency average = 22.135 ms
latency average = 21.874 ms
latency average = 21.910 ms
40
latency average = 22.916 ms
latency average = 23.086 ms
latency average = 22.422 ms
48
latency average = 23.255 ms
latency average = 22.780 ms
latency average = 22.804 ms

These are still a bit noisy, and close, but, it seems there is no
penalty in using the function pointer as long as the header
calculation is inlined.

--
John Naylor
Amazon Web Services



Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote:
> That was my hunch too, but I wanted to be more sure, so I modified the
> benchmark so it doesn't know the address of the next calculation until
> it finishes the last calculation so we can hopefully see the latency
> caused by indirection. It also does an additional calculation on
> constant 20 bytes, like the WAL header. I also tweaked the length each
> iteration so the branch predictor maybe has a harder time predicting
> the constant 20 input. And to make it more challenging, I removed the
> part that inlined all small inputs, so it inlines only constant
> inputs:

Would you mind sharing this test?  It sounds like you are running a
workload with a mix of constant/inlined calls and function pointer calls to
simulate typical usage for WAL, but I'm not 100% sure I'm understanding you
correctly.

> These are still a bit noisy, and close, but, it seems there is no
> penalty in using the function pointer as long as the header
> calculation is inlined.

These results look promising.

-- 
nathan



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Wed, Mar 5, 2025 at 10:52 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote:
> > That was my hunch too, but I wanted to be more sure, so I modified the
> > benchmark so it doesn't know the address of the next calculation until
> > it finishes the last calculation so we can hopefully see the latency
> > caused by indirection. It also does an additional calculation on
> > constant 20 bytes, like the WAL header. I also tweaked the length each
> > iteration so the branch predictor maybe has a harder time predicting
> > the constant 20 input. And to make it more challenging, I removed the
> > part that inlined all small inputs, so it inlines only constant
> > inputs:
>
> Would you mind sharing this test?

The test script is the same as here, except I only ran small lengths:

https://www.postgresql.org/message-id/CANWCAZahvhE-%2BhtZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ%40mail.gmail.com

...but I must have forgotten to attach the slightly tweaked patch set,
which I've done now. 0002 modifies the 0001 test module and 0006
reverts inlining non-constant input from 0005, just to see if I could
find a regression from indirection, which I didn't. If we don't need
it, it'd better to avoid inlining loops to keep from bloating the
binary.

> It sounds like you are running a
> workload with a mix of constant/inlined calls and function pointer calls to
> simulate typical usage for WAL, but I'm not 100% sure I'm understanding you
> correctly.

Exactly.

--
John Naylor
Amazon Web Services

Attachment

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Overall, I wish we could avoid splitting things into separate files and
> adding more header file gymnastics, but maybe there isn't much better we
> can do without overhauling the CPU feature detection code.

I wanted to make an attempt to make this aspect nicer. v13-0002
incorporates deliberately compact and simple loops for inlined
constant input into the dispatch function, and leaves the existing
code alone. This avoids code churn and saves vertical space in the
copied code. It needs a bit more commentary, but I hope this is a more
digestible prerequisite to the CLMUL algorithm -- as a reminder, it'll
be simpler if we can always assume non-constant input can go through a
function pointer.

I've re-attached the modified perf test from v12 just in case anyone
wants to play with it (v13-0003), but named so that the CF bot can't
find it, since it breaks the tests in the original perf test (It's not
for commit anyway).

Adding back AVX-512 should be fairly mechanical, since Raghuveer and
Nathan have already done the work needed for that.

--
John Naylor
Amazon Web Services

Attachment

Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Overall, I wish we could avoid splitting things into separate files and
>> adding more header file gymnastics, but maybe there isn't much better we
>> can do without overhauling the CPU feature detection code.
> 
> I wanted to make an attempt to make this aspect nicer. v13-0002
> incorporates deliberately compact and simple loops for inlined
> constant input into the dispatch function, and leaves the existing
> code alone. This avoids code churn and saves vertical space in the
> copied code. It needs a bit more commentary, but I hope this is a more
> digestible prerequisite to the CLMUL algorithm -- as a reminder, it'll
> be simpler if we can always assume non-constant input can go through a
> function pointer.

That is certainly more readable.  FWIW I think it would be entirely
reasonable to replace the pg_crc32c_sse42.c implementation with a call to
this new pg_comp_crc32c_dispatch() function.  Of course, you'd have to
split things up like:

    pg_attribute_no_sanitize_alignment()
    static inline
    pg_crc32c
    pg_comp_crc32c_sse42_inline(pgcrc32c crc, const void *data, size_t len)
    {
        const unsigned char *p = data;

    #ifdef __x86_64__
        for (; len >= 8; p += 8, len -= 8)
            crc = _mm_crc32_u64(crc, *(const uint64 *) p);
    #endif
        for (; len >= 4; p += 4, len -= 4)
            crc = _mm_crc32_u32(crc, *(const uint32 *) p);
        for (; len > 0; len--)
            crc = _mm_crc32_u8(crc, *p++)
        return crc;
    }

    pg_attribute_no_sanitize_alignment()
    static inline
    pg_crc32c
    pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
    {
        if (__builtin_constant_p(len))
            return pg_comp_crc32c_sse42_inline(crc, data, len);
        return pg_comp_crc32c_sse42(crc, data, len);
    }

And then in pg_crc32c_sse42.c:

    pg_attribute_no_sanitize_alignment()
    pg_attribute_target("sse4.2")
    pg_crc32c
    pg_comp_crc32c_sse42(pg_crc32c crc, const void *data, size_t len)
    {
        return pg_comp_crc32c_sse42_inline(crc, data, len);
    }

Granted, that adds back some of the complexity you were trying to avoid
(and is very similar to your v12 patches), but it's pretty compact and
avoids some code duplication.  FTR I don't feel too strongly about this,
but on balance I guess I'd be okay with a tad more complexity than your
v13 patches if it served a useful purpose.

-- 
nathan



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Mar 11, 2025 at 4:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote:
> > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> Overall, I wish we could avoid splitting things into separate files and
> >> adding more header file gymnastics, but maybe there isn't much better we
> >> can do without overhauling the CPU feature detection code.
> >
> > I wanted to make an attempt to make this aspect nicer. v13-0002
> > incorporates deliberately compact and simple loops for inlined
> > constant input into the dispatch function, and leaves the existing
> > code alone. This avoids code churn and saves vertical space in the
> > copied code. It needs a bit more commentary, but I hope this is a more
> > digestible prerequisite to the CLMUL algorithm -- as a reminder, it'll
> > be simpler if we can always assume non-constant input can go through a
> > function pointer.
>
> That is certainly more readable.  FWIW I think it would be entirely
> reasonable to replace the pg_crc32c_sse42.c implementation with a call to
> this new pg_comp_crc32c_dispatch() function.  Of course, you'd have to
> split things up like:
> [snip]

That could work as well. I'm thinking if we do PMULL on Arm, it might
be advantageous to keep the inline path and function paths with
distinct coding -- because of the pickier alignment on that platform,
it might not be worth pre-aligning the pointer to 8 bytes for a
20-byte constant input.

I've gone ahead and added the generated AVX-512 algorithm in v14-0005,
and added the build support and some of the runtime support from Paul
and Raghuveer's earlier patches in 0006-7. It passes CI, but I'll have
to arrange access to other hardware to verify the runtime behavior. I
think the Meson support is most of the way there, but it looks like
configure.ac got whacked around cosmetically quite a bit. If we feel
it's time to refactor things there, we'll want to split that out. In
any case, for autoconf I've pretty much kept the earlier work for now.

--
John Naylor
Amazon Web Services

Attachment

RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> I've gone ahead and added the generated AVX-512 algorithm in v14-0005

+ pg_attribute_target("avx512vl,vpclmulqdq")
+ pg_crc32c
+ pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length)

I'm a little confused. Why is v14 missing the earlier versions of pclmul implementation that works without avx-512? 

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Wed, Mar 12, 2025 at 3:57 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> > I've gone ahead and added the generated AVX-512 algorithm in v14-0005

> I'm a little confused. Why is v14 missing the earlier versions of pclmul implementation that works without avx-512?

As I mentioned upthread, the 128-bit implementation regresses on Zen 2
up to at least 256 bytes.

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
 > > I've gone ahead and added the generated AVX-512 algorithm in
 > > v14-0005

Here is my benchmark numbers (master v/s v14) on TGL (time in ms): 

Buf-size-bytes  Master  v14
64      9.547   6.095
80      10.999  6.248
96      12.443  7.756
112     14.129  9.62
128     15.295  6.534
144     16.756  8.226
160     18.18   9.656
176     19.836  11.556
192     22.038  8.307
208     24.303  9.692
224     26.656  11.115
240     28.981  13.119
256     31.399  10.035

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Thu, Mar 13, 2025 at 11:38 PM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
>  > > I've gone ahead and added the generated AVX-512 algorithm in
>  > > v14-0005
>
> Here is my benchmark numbers (master v/s v14) on TGL (time in ms):
>
> Buf-size-bytes  Master  v14
> 64      9.547   6.095
...
> 256     31.399  10.035

Thanks for testing! Looks good. I'll take a look at the configure
checks soon, since I had some questions there.

--
John Naylor
Amazon Web Services



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Mon, Mar 24, 2025 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
> I'll take a look at the configure
> checks soon, since I had some questions there.

I'm leaning towards a length limit for v15-0001 so that inlined
instructions are likely to be unrolled. Aside from lack of commit
message, I think that one is ready for commit soon-ish.

I'm feeling pretty good about 0002, but since there is still room for
cosmetic fiddling, I want to let it sit for a bit longer.

I felt the previous proposals for configure.ac were unnecessarily
invasive, and the message looked out of place, so I made configure.ac
more similar to master, using the AVX popcount stuff as a model. I
also went the extra step and added a separate AC_MSG_CHECKING for
vectorized CRC. I'm not sure we really need that, but this algorithm
is trivially adoptable to Arm so it might be welcome for visibility.

For Meson, I just made the CRC checking comment a bit more general,
since keeping up this level of detail would result a loss in
readability.

0003 is just to demonstrate on CI that we are in fact computing the
same answer as master. An earlier patch had some additional tests in
strings.sql but I have yet to dig those out.

--
John Naylor
Amazon Web Services

Attachment

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Mon, Mar 24, 2025 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:
>
> I'll take a look at the configure
> checks soon, since I had some questions there.

One other thing I forgot to mention: The previous test function had
local constants that the compiler was able to fold, resulting in no
actual vector instructions being emitted:

movabs  rdx, 12884901891
xor     eax, eax
crc32   rax, rdx
crc32   rax, rdx
ret

That may be okay for practical purposes, but in the spirit of commit
fdb5dd6331e30 I changed it in v15 to use global variables and made
sure it emits what the function attributes are intended for:

vmovdqu64       zmm3, ZMMWORD PTR x[rip]
xor     eax, eax
vpclmulqdq      zmm0, zmm3, ZMMWORD PTR y[rip], 0
vextracti32x4   xmm2, zmm0, 1
vmovdqa64       xmm1, xmm0
vmovdqu64       ZMMWORD PTR y[rip], zmm0
vextracti32x4   xmm0, zmm0, 2
vpternlogq      xmm1, xmm2, xmm0, 150
vmovq   rdx, xmm1
crc32   rax, rdx
vzeroupper
ret

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
Hello John, 

v15 LGTM. Couple of minor comments: 
 
> I'm leaning towards a length limit for v15-0001 so that inlined instructions are
> likely to be unrolled. Aside from lack of commit message, I think that one is ready
> for commit soon-ish.

+1 

> I'm feeling pretty good about 0002, but since there is still room for cosmetic
> fiddling, I want to let it sit for a bit longer.

(1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of pg_popcount_avx512.c but perhaps that’s fine for
now.I will propose a consolidated SIMD runtime check in a follow up patch. 
 
(2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c  since they contain both sse42 and avx512 versions.

> I felt the previous proposals for configure.ac were unnecessarily invasive, and the
> message looked out of place, so I made configure.ac more similar to master, using
> the AVX popcount stuff as a model. 

Looks good to me. 

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> Hello John,
>
> v15 LGTM. Couple of minor comments:
>
> > I'm leaning towards a length limit for v15-0001 so that inlined instructions are
> > likely to be unrolled. Aside from lack of commit message, I think that one is ready
> > for commit soon-ish.
>
> +1

Thanks for looking! This has been committed.

> > I'm feeling pretty good about 0002, but since there is still room for cosmetic
> > fiddling, I want to let it sit for a bit longer.
>
> (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of pg_popcount_avx512.c but perhaps that’s fine
fornow. I will propose a consolidated SIMD runtime check in a follow up patch. 

Yeah, I was thinking a small amount of duplication is tolerable.

As far as possible better abstraction for next cycle, I went looking
and found this, which has some features we've talked about:

https://zolk3ri.name/cgit/cpudetect/tree/cpudetect.h

> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c  since they contain both sse42 and avx512 versions.

The name is now not quite accurate, but it's not exactly misleading
either. I'm leaning towards keeping it the same, so for now I've just
updated the header comment.

For v16, I made another pass through made some more mostly superficial
adjustments:
- copied rewritten Meson comment to configure.ac
- added some more #include guards out of paranoia
- added tests with longer lengths that exercise the new paths
- adjusted configure / meson messages for consistency
- changed not-quite-accurate wording about "AVX-512 CRC instructions"
- used "PCLMUL" only when talking about specific intrinsics and prefer
"AVX-512" elsewhere, to head off potential future confusion with Arm
PMULL.


--
John Naylor
Amazon Web Services

Attachment

Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote:
> On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote:
>> (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of
>> pg_popcount_avx512.c but perhaps that’s fine for now. I will propose a
>> consolidated SIMD runtime check in a follow up patch.
> 
> Yeah, I was thinking a small amount of duplication is tolerable.

+1.  This is easy enough to change in the future if/when we add some sort
of centralized CPU feature detection.

>> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c  since
>> they contain both sse42 and avx512 versions.
> 
> The name is now not quite accurate, but it's not exactly misleading
> either. I'm leaning towards keeping it the same, so for now I've just
> updated the header comment.

I'm not too worried about this one either.  FWIW I'm likely going to look
into moving all the x86_64 popcount stuff into pg_popcount_avx512.c and
renaming it to pg_popcount_x86_64.c for v19.  This would parallel
pg_popcount_aarch64.c a bit better, and a file per architecture seems like
a logical way to neatly organize things.

> For v16, I made another pass through made some more mostly superficial
> adjustments:
> - copied rewritten Meson comment to configure.ac
> - added some more #include guards out of paranoia
> - added tests with longer lengths that exercise the new paths
> - adjusted configure / meson messages for consistency
> - changed not-quite-accurate wording about "AVX-512 CRC instructions"
> - used "PCLMUL" only when talking about specific intrinsics and prefer
> "AVX-512" elsewhere, to head off potential future confusion with Arm
> PMULL.

I read through the code a couple of times and nothing stood out to me.

-- 
nathan



Re: Improve CRC32C performance on SSE4.2

From
John Naylor
Date:
On Tue, Apr 1, 2025 at 11:25 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote:
> > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote:

> >> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c  since
> >> they contain both sse42 and avx512 versions.
> >
> > The name is now not quite accurate, but it's not exactly misleading
> > either. I'm leaning towards keeping it the same, so for now I've just
> > updated the header comment.
>
> I'm not too worried about this one either.  FWIW I'm likely going to look
> into moving all the x86_64 popcount stuff into pg_popcount_avx512.c and
> renaming it to pg_popcount_x86_64.c for v19.  This would parallel
> pg_popcount_aarch64.c a bit better, and a file per architecture seems like
> a logical way to neatly organize things.

Seems like a good idea.

> I read through the code a couple of times and nothing stood out to me.

Thanks for looking, I plan to commit this over the weekend unless
there are objections.

--
John Naylor
Amazon Web Services



RE: Improve CRC32C performance on SSE4.2

From
"Devulapalli, Raghuveer"
Date:
> Thanks for looking, I plan to commit this over the weekend unless there are
> objections.

LGTM. 

Raghuveer

Re: Improve CRC32C performance on SSE4.2

From
Nathan Bossart
Date:
On Wed, Apr 02, 2025 at 02:10:40PM +0700, John Naylor wrote:
> Thanks for looking, I plan to commit this over the weekend unless
> there are objections.

I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but
everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the
"32C" included).  I tested the v16 patches (with the macro fixed and
assertions enabled) on a machine with AVX-512 (verified with some extra
debug logging), and everything passed.

-- 
nathan