Re: Proposal for Updating CRC32C with AVX-512 Algorithm. - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: Proposal for Updating CRC32C with AVX-512 Algorithm. |
Date | |
Msg-id | CANWCAZZSs2YANJENsn0TDm14-Q7vVrieMyKrSO3AyzGG=c7SUw@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal for Updating CRC32C with AVX-512 Algorithm. (Andres Freund <andres@anarazel.de>) |
Responses |
RE: Proposal for Updating CRC32C with AVX-512 Algorithm.
|
List | pgsql-hackers |
On Thu, Jun 13, 2024 at 3:11 AM Andres Freund <andres@anarazel.de> wrote: > > On 2024-05-01 15:56:08 +0000, Amonson, Paul D wrote: > > Workload call size distribution details (write heavy): > > * Average was approximately around 1,010 bytes per call > > * ~80% of the calls were under 256 bytes > > * ~20% of the calls were greater than or equal to 256 bytes up to the max buffer size of 8192 > > This is extremely workload dependent, it's not hard to find workloads with > lots of very small record and very few big ones... What you observed might > have "just" been the warmup behaviour where more full page writes have to be > written. Sorry for going back so far, but this thread was pointed out to me, and this aspect of the design could use some more discussion: + * pg_crc32c_avx512(): compute the crc32c of the buffer, where the + * buffer length must be at least 256, and a multiple of 64. Based There is another technique that computes CRC on 3 separate chunks and combines them at the end, so about 3x faster on large-enough chunks. That's the way used for the Arm proposal [0], coincidentally also citing a white paper from Intel, but as Dimitry pointed out in that thread, its link has apparently disappeared. Raghuveer, do you know about this, and is there another link available? http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf The cut off point in one implementation is only 144 bytes [1] , which is maybe not as small as we'd like, but is quite a bit smaller than 256. That seems better suited to our workloads, and more portable. I have a *brand-new* laptop with an Intel chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture. I also understand that Sierra Forrest (a server product line) will be all little cores with no AVX-512 support, so I'm not sure why the proposal here requires AVX-512. > There a very frequent call computing COMP_CRC32C over just 20 bytes, while > holding a crucial lock. If we were to do introduce something like this > AVX-512 algorithm, it'd probably be worth to dispatch differently in case of > compile-time known small lengths. I know you've read an earlier version of the patch and realized that it wouldn't help here, but we could probably dispatch differently regardless, although it may only be worth it if we can inline the instructions. Since we technically only need to wait for xl_prev, I believe we could push the computation of the other 12 bytes to before acquiring the lock, then only execute a single instruction on xl_prev to complete the CRC computation. Is there any reason why we couldn't do that, assuming we have a clean way to make that portable? That would mean that the CRCs between major versions would be different, but I think we don't guarantee that anyway. [0] https://commitfest.postgresql.org/50/4620/ [1] https://github.com/komrad36/CRC/blob/master/CRC/golden_intel.cpp#L138C27-L138C42 -- John Naylor Amazon Web Services
pgsql-hackers by date: