RE: Proposal for Updating CRC32C with AVX-512 Algorithm. - Mailing list pgsql-hackers

From Devulapalli, Raghuveer
Subject RE: Proposal for Updating CRC32C with AVX-512 Algorithm.
Date
Msg-id PH8PR11MB8286507D21FBF21736BE6680FB322@PH8PR11MB8286.namprd11.prod.outlook.com
Whole thread Raw
In response to Re: Proposal for Updating CRC32C with AVX-512 Algorithm.  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: Proposal for Updating CRC32C with AVX-512 Algorithm.
List pgsql-hackers
> Sorry for going back so far, but this thread was pointed out to me, and this aspect
> of the design could use some more discussion:
> 
> + * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
> + * buffer length must be at least 256, and a multiple of 64. Based
> 
> There is another technique that computes CRC on 3 separate chunks and
> combines them at the end, so about 3x faster on large-enough chunks.
> That's the way used for the Arm proposal [0], coincidentally also citing a white
> paper from Intel, but as Dimitry pointed out in that thread, its link has apparently
> disappeared. Raghuveer, do you know about this, and is there another link
> available?
> 
> http://www.intel.com/content/dam/www/public/us/en/documents/white-
> papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

 I am not aware of this paper. Let me poke a few people internally and get back to you on this. 

> The cut off point in one implementation is only 144 bytes [1] , which is maybe not
> as small as we'd like, but is quite a bit smaller than 256. That seems better suited
> to our workloads, and more portable. I have a *brand-new* laptop with an Intel
> chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture.
> I also understand that Sierra Forrest (a server product line) will be all little cores
> with no AVX-512 support, so I'm not sure why the proposal here requires AVX-
> 512.

AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. Sierra Forest contains the SSE and
AVX/AVX2family ISA but AFAIK AVX/AVX2 does not contain any CRC32C specific instructions. See:
 

1) https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=pclmul&ig_expand=754&techs=AVX_ALL
2) https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=754&techs=AVX_ALL&text=crc32

> 
> > There a very frequent call computing COMP_CRC32C over just 20 bytes,
> > while holding a crucial lock.  If we were to do introduce something
> > like this
> > AVX-512 algorithm, it'd probably be worth to dispatch differently in
> > case of compile-time known small lengths.
> 
> I know you've read an earlier version of the patch and realized that it wouldn't
> help here, but we could probably dispatch differently regardless, although it may
> only be worth it if we can inline the instructions. Since we technically only need to
> wait for xl_prev, I believe we could push the computation of the other 12 bytes to
> before acquiring the lock, then only execute a single instruction on xl_prev to
> complete the CRC computation. Is there any reason why we couldn't do that,
> assuming we have a clean way to make that portable? That would mean that the
> CRCs between major versions would be different, but I think we don't guarantee
> that anyway.

Not sure about that. This is not my expertise and I might need a little time to figure this out. Unfortunately, I am on
travelwith limited internet connection for the next 6 weeks. I will only be able to address this when I get back. Is
thisa blocker for the patch or is this something we can address as a revision? 
 
 
Raghuveer
 


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: doc: Remove LC_COLLATE and LC_CTYPE from SHOW command
Next
From: Tom Lane
Date:
Subject: Re: Support for unsigned integer types