Re: Proposal for Updating CRC32C with AVX-512 Algorithm. - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | Re: Proposal for Updating CRC32C with AVX-512 Algorithm. |
Date | |
Msg-id | CANWCAZbr4sO1bPoS+E=iRWnrBZp7zUKZEJk39KYt_Pu9+X1-SQ@mail.gmail.com Whole thread Raw |
In response to | RE: Proposal for Updating CRC32C with AVX-512 Algorithm. ("Devulapalli, Raghuveer" <raghuveer.devulapalli@intel.com>) |
List | pgsql-hackers |
On Sat, Dec 7, 2024 at 10:16 PM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > > There is another technique that computes CRC on 3 separate chunks and > > combines them at the end, so about 3x faster on large-enough chunks. > > That's the way used for the Arm proposal [0], coincidentally also citing a white > > paper from Intel, but as Dimitry pointed out in that thread, its link has apparently > > disappeared. Raghuveer, do you know about this, and is there another link > > available? > > > > http://www.intel.com/content/dam/www/public/us/en/documents/white- > > papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf > > I am not aware of this paper. Let me poke a few people internally and get back to you on this. Thanks! I have a portable PoC of how this works, but I'll save that for another thread, since it's not Intel (or Arm) specific. > > The cut off point in one implementation is only 144 bytes [1] , which is maybe not > > as small as we'd like, but is quite a bit smaller than 256. That seems better suited > > to our workloads, and more portable. I have a *brand-new* laptop with an Intel > > chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture. > > I also understand that Sierra Forrest (a server product line) will be all little cores > > with no AVX-512 support, so I'm not sure why the proposal here requires AVX- > > 512. > > AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. Sierra Forest contains the SSE and AVX/AVX2family ISA but AFAIK AVX/AVX2 does not contain any CRC32C specific instructions. See: CRC32C was added in SSE 4.2, so it's quite old. The AVX-512 intrinsics used in the patch are not CRC-specific, if I understand correctly. My point was, it seems Intel still considers AVX-512 as optional, so we can't count on it being present even in future chips. That's why I'm interested in alternatives, at least as a first step. If we can get 3x throughput, the calculation might bend up low enough in the profile that going to 6x might not be noticeable (not sure). > > > There a very frequent call computing COMP_CRC32C over just 20 bytes, > > > while holding a crucial lock. If we were to do introduce something > > > like this > > > AVX-512 algorithm, it'd probably be worth to dispatch differently in > > > case of compile-time known small lengths. > > > > I know you've read an earlier version of the patch and realized that it wouldn't > > help here, but we could probably dispatch differently regardless, although it may > > only be worth it if we can inline the instructions. Since we technically only need to > > wait for xl_prev, I believe we could push the computation of the other 12 bytes to > > before acquiring the lock, then only execute a single instruction on xl_prev to > > complete the CRC computation. Is there any reason why we couldn't do that, > > assuming we have a clean way to make that portable? That would mean that the > > CRCs between major versions would be different, but I think we don't guarantee > > that anyway. > > Not sure about that. This is not my expertise and I might need a little time to figure this out. Unfortunately, I am ontravel with limited internet connection for the next 6 weeks. I will only be able to address this when I get back. Is thisa blocker for the patch or is this something we can address as a revision? This is orthogonal and is not related to the patch, since it doesn't affect 8 and 20-byte paths, only 256 and greater. -- John Naylor Amazon Web Services
pgsql-hackers by date: