Re: Improve CRC32C performance on SSE4.2 - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Improve CRC32C performance on SSE4.2
Date
Msg-id Z8c6Cfp-XIiJtGB5@nathan
Whole thread Raw
In response to Re: Improve CRC32C performance on SSE4.2  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: Improve CRC32C performance on SSE4.2
List pgsql-hackers
On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> This could potentially lead to a small regression for machines with SSE
>> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
>> worry aobut.
> 
> Note also upthread I mentioned we may have to go to 512-bit pclmul,
> since Zen 2 regresses on 128-bit. :-(

Ah, okay.  You mean the AVX-512 version [0]?  And are you thinking we'd use
the same strategy for the compiled-in-SSE4.2 builds, i.e., inline the
SSE4.2 version for small inputs and use a function pointer for larger ones?

> I actually haven't seen any measurable difference with direct calls
> versus indirect, but it could very well be that the microbenchmark is
> hiding that since it's doing something unnatural by calling things a
> bunch of times in a loop. I want to try changing the benchmark to base
> the address it's computing on some bits from the crc from the last
> loop iteration. I think that would make it more latency-sensitive. We
> could also make it do an additional constant 20-byte input every time
> to make it resemble WAL more closely.

Looking back on some old benchmarks for small-ish inputs [0], the
difference does seem within the noise range.  I suppose these functions
might be expensive enough to make the function pointer overhead negligible.
IME there's a big difference when a function pointer is used for an
instruction or two [2], but even relatively small inputs to the CRC-32C
functions might require several instructions.

>> The main question I have is whether we can simplify this by always using a
>> runtime check and by inlining slicing-by-8 for small inputs.  That would be
>> dependent on the performance of slicing-by-8 and SSE 4.2 being comparable
>> for small inputs.
> 
> Slicing-by-8 needs one lookup and one XOR per byte of input, and other
> overheads, so I think it would still be very slow.

That's my suspicion, too.

[0] https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192%40BL1PR11MB5304.namprd11.prod.outlook.com
[1] https://postgr.es/m/20231031033601.GA68409%40nathanxps13
[2] https://postgr.es/m/CAApHDvqyMNGVgwpaOPtENdq5uEMR%3DvSkRJEgG1S%2BX7Vtk1-EnA%40mail.gmail.com

-- 
nathan



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Add -k/--link option to pg_combinebackup
Next
From: Anthonin Bonnefoy
Date:
Subject: Re: Add Pipelining support in psql