RE: Improve CRC32C performance on SSE4.2 - Mailing list pgsql-hackers

From Devulapalli, Raghuveer
Subject RE: Improve CRC32C performance on SSE4.2
Date
Msg-id PH8PR11MB8286F844321BA1DEEC518348FBFD2@PH8PR11MB8286.namprd11.prod.outlook.com
Whole thread Raw
Responses Re: Improve CRC32C performance on SSE4.2
List pgsql-hackers

Hi John,

 

> I'm highly suspicious of these numbers because they show this version

> is about 20x faster than "scalar", so relatively speaking 3x faster

> than the AVX-512 proposal?

 

Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote  for testing CRC32C correctness (which computes crc one byte at time). The code for microbenchmarking is all here: https://github.com/r-devulap/crc32c and this commit https://github.com/r-devulap/crc32c/commit/ca4d1b24fd8af87aab544fb1634523b6657325a0 fixes that. Rerunning the benchmarks gives me more sensible numbers:

 

Comparing scalar_crc32c to sse42_crc32c (from ./bench)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[scalar_crc32c vs. sse42_crc32c]/64                  -0.0972         -0.0971             5             4             5             4

[scalar_crc32c vs. sse42_crc32c]/128                 -0.3048         -0.3048             8             6             8             6

[scalar_crc32c vs. sse42_crc32c]/256                 -0.4610         -0.4610            19            10            19            10

[scalar_crc32c vs. sse42_crc32c]/512                 -0.6432         -0.6432            50            18            50            18

[scalar_crc32c vs. sse42_crc32c]/1024                -0.7192         -0.7192           121            34           121            34

[scalar_crc32c vs. sse42_crc32c]/2048                -0.7275         -0.7276           259            70           259            70

 

> Luckily, that's easily fixable: It turns out the implementation

> in the paper (and chromium) has a very inefficient finalization step,

>  using more carryless multiplications and plenty of other operations.

> After the main loop, and at the end, it's much more efficient to

> convert the 128-bit intermediate result directly into a CRC in the

> usual way.

 

Thank you for pointing this out and also fixing it! This improves over the chromium version by 10-25% especially for with smaller byte size 64 – 512 bytes:

 

Comparing sse42_crc32c to corsix_crc32c (from ./bench)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[sse42_crc32c vs. corsix_crc32c]/64                  -0.2696         -0.2696             4             3             4             3

[sse42_crc32c vs. corsix_crc32c]/128                 -0.1551         -0.1552             6             5             6             5

[sse42_crc32c vs. corsix_crc32c]/256                 -0.1787         -0.1787            10             8            10             8

[sse42_crc32c vs. corsix_crc32c]/512                 -0.1351         -0.1351            18            15            18            15

[sse42_crc32c vs. corsix_crc32c]/1024                -0.0972         -0.0972            34            31            34            31

[sse42_crc32c vs. corsix_crc32c]/2048                -0.0763         -0.0763            69            64            69            64

OVERALL_GEOMEAN                                      -0.1544         -0.1544             0             0             0             0

 

> I generated a similar function for v2-0004 and this benchmark shows it's faster than master on 128 bytes and above.

 

I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsix is slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates less than optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm that behavior on your end too?

 

---

src/port/pg_crc32c_sse42.c | 1 +

1 file changed, 1 insertion(+)

 

diff --git a/src/port/pg_crc32c_sse42.c b/src/port/pg_crc32c_sse42.c

index a8c1e5609b..a350b1b93a 100644

--- a/src/port/pg_crc32c_sse42.c

+++ b/src/port/pg_crc32c_sse42.c

@@ -81,6 +81,7 @@ pg_comp_crc32c_sse42_tail(pg_crc32c crc, const void *data, size_t len)

#define clmul_lo(a, b) (_mm_clmulepi64_si128((a), (b), 0))

#define clmul_hi(a, b) (_mm_clmulepi64_si128((a), (b), 17))

pg_attribute_no_sanitize_alignment()

+__attribute__((optimize("-O3")))

pg_attribute_target("sse4.2,pclmul")

pg_crc32c

pg_comp_crc32c_sse42(pg_crc32c crc0, const void *data, size_t len)

--

 

You could also just build with export CFLAGS="-O3" instead of adding the function attribute.

 

> I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.

 

May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.

 

> It's probably okay to fold these together in the same compile-time

> check, since both are fairly old by now, but for those following

> along, pclmul is not in SSE 4.2 and is a bit newer. So this would

> cause machines building on Nehalem (2008) to fail the check and go

> back to slicing-by-8 with it written this way.

 

Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crash with segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointer accordingly.  Let me know which one you would prefer.

 

Raghuveer

 

From: Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers@lists.postgresql.org
Cc: Shankaran, Akash <akash.shankaran@intel.com>; Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com>
Subject: Improve CRC32C performance on SSE4.2

 

This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in  this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm is based on sse42 version of crc32 computation from Chromium’s copy of zlib with modified constants for crc32c computation. See:

 

https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/zlib/crc32_simd.c

 

Microbenchmarks (generated with google benchmark using a standalone version of the same algorithms):

 

Comparing scalar_crc32c to sse42_crc32c (for various buffer sizes: 64, 128, 256, 512, 1024, 2048 bytes)

Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New

------------------------------------------------------------------------------------------------------------------------------------

[scalar_crc32c vs. sse42_crc32c]/64                  -0.8147         -0.8148            33             6            33             6

[scalar_crc32c vs. sse42_crc32c]/128                -0.8962         -0.8962            88             9            88             9

[scalar_crc32c vs. sse42_crc32c]/256                -0.9200         -0.9200           211            17           211            17

[scalar_crc32c vs. sse42_crc32c]/512                -0.9389         -0.9389           486            30           486            30

[scalar_crc32c vs. sse42_crc32c]/1024              -0.9452         -0.9452          1037            57          1037            57

[scalar_crc32c vs. sse42_crc32c]/2048             -0.9456         -0.9456          2140           116          2140           116

 

Raghuveer

pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: Proposal to CREATE FOREIGN TABLE LIKE
Next
From: "David G. Johnston"
Date:
Subject: Re: describe special values in GUC descriptions more consistently