RE: Proposal for Updating CRC32C with AVX-512 Algorithm. - Mailing list pgsql-hackers
From | Devulapalli, Raghuveer |
---|---|
Subject | RE: Proposal for Updating CRC32C with AVX-512 Algorithm. |
Date | |
Msg-id | PH8PR11MB8286507D21FBF21736BE6680FB322@PH8PR11MB8286.namprd11.prod.outlook.com Whole thread Raw |
In response to | Re: Proposal for Updating CRC32C with AVX-512 Algorithm. (John Naylor <johncnaylorls@gmail.com>) |
Responses |
Re: Proposal for Updating CRC32C with AVX-512 Algorithm.
|
List | pgsql-hackers |
> Sorry for going back so far, but this thread was pointed out to me, and this aspect > of the design could use some more discussion: > > + * pg_crc32c_avx512(): compute the crc32c of the buffer, where the > + * buffer length must be at least 256, and a multiple of 64. Based > > There is another technique that computes CRC on 3 separate chunks and > combines them at the end, so about 3x faster on large-enough chunks. > That's the way used for the Arm proposal [0], coincidentally also citing a white > paper from Intel, but as Dimitry pointed out in that thread, its link has apparently > disappeared. Raghuveer, do you know about this, and is there another link > available? > > http://www.intel.com/content/dam/www/public/us/en/documents/white- > papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf I am not aware of this paper. Let me poke a few people internally and get back to you on this. > The cut off point in one implementation is only 144 bytes [1] , which is maybe not > as small as we'd like, but is quite a bit smaller than 256. That seems better suited > to our workloads, and more portable. I have a *brand-new* laptop with an Intel > chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture. > I also understand that Sierra Forrest (a server product line) will be all little cores > with no AVX-512 support, so I'm not sure why the proposal here requires AVX- > 512. AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. Sierra Forest contains the SSE and AVX/AVX2family ISA but AFAIK AVX/AVX2 does not contain any CRC32C specific instructions. See: 1) https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=pclmul&ig_expand=754&techs=AVX_ALL 2) https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=754&techs=AVX_ALL&text=crc32 > > > There a very frequent call computing COMP_CRC32C over just 20 bytes, > > while holding a crucial lock. If we were to do introduce something > > like this > > AVX-512 algorithm, it'd probably be worth to dispatch differently in > > case of compile-time known small lengths. > > I know you've read an earlier version of the patch and realized that it wouldn't > help here, but we could probably dispatch differently regardless, although it may > only be worth it if we can inline the instructions. Since we technically only need to > wait for xl_prev, I believe we could push the computation of the other 12 bytes to > before acquiring the lock, then only execute a single instruction on xl_prev to > complete the CRC computation. Is there any reason why we couldn't do that, > assuming we have a clean way to make that portable? That would mean that the > CRCs between major versions would be different, but I think we don't guarantee > that anyway. Not sure about that. This is not my expertise and I might need a little time to figure this out. Unfortunately, I am on travelwith limited internet connection for the next 6 weeks. I will only be able to address this when I get back. Is thisa blocker for the patch or is this something we can address as a revision? Raghuveer
pgsql-hackers by date: