Re: Proposal for Updating CRC32C with AVX-512 Algorithm. - Mailing list pgsql-hackers

From John Naylor
Subject Re: Proposal for Updating CRC32C with AVX-512 Algorithm.
Date
Msg-id CANWCAZbr4sO1bPoS+E=iRWnrBZp7zUKZEJk39KYt_Pu9+X1-SQ@mail.gmail.com
Whole thread Raw
In response to RE: Proposal for Updating CRC32C with AVX-512 Algorithm.  ("Devulapalli, Raghuveer" <raghuveer.devulapalli@intel.com>)
List pgsql-hackers
On Sat, Dec 7, 2024 at 10:16 PM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:

> > There is another technique that computes CRC on 3 separate chunks and
> > combines them at the end, so about 3x faster on large-enough chunks.
> > That's the way used for the Arm proposal [0], coincidentally also citing a white
> > paper from Intel, but as Dimitry pointed out in that thread, its link has apparently
> > disappeared. Raghuveer, do you know about this, and is there another link
> > available?
> >
> > http://www.intel.com/content/dam/www/public/us/en/documents/white-
> > papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
>
>  I am not aware of this paper. Let me poke a few people internally and get back to you on this.

Thanks! I have a portable PoC of how this works, but I'll save that
for another thread, since it's not Intel (or Arm) specific.

> > The cut off point in one implementation is only 144 bytes [1] , which is maybe not
> > as small as we'd like, but is quite a bit smaller than 256. That seems better suited
> > to our workloads, and more portable. I have a *brand-new* laptop with an Intel
> > chip, and IIUC it doesn't support AVX-512 because it uses a big-little architecture.
> > I also understand that Sierra Forrest (a server product line) will be all little cores
> > with no AVX-512 support, so I'm not sure why the proposal here requires AVX-
> > 512.
>
> AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. Sierra Forest contains the SSE and
AVX/AVX2family ISA but AFAIK AVX/AVX2 does not contain any CRC32C specific instructions. See: 

CRC32C was added in SSE 4.2, so it's quite old. The AVX-512 intrinsics
used in the patch are not CRC-specific, if I understand correctly.

My point was, it seems Intel still considers AVX-512 as optional, so
we can't count on it being present even in future chips. That's why
I'm interested in alternatives, at least as a first step. If we can
get 3x throughput, the calculation might bend up low enough in the
profile that going to 6x might not be noticeable (not sure).

> > > There a very frequent call computing COMP_CRC32C over just 20 bytes,
> > > while holding a crucial lock.  If we were to do introduce something
> > > like this
> > > AVX-512 algorithm, it'd probably be worth to dispatch differently in
> > > case of compile-time known small lengths.
> >
> > I know you've read an earlier version of the patch and realized that it wouldn't
> > help here, but we could probably dispatch differently regardless, although it may
> > only be worth it if we can inline the instructions. Since we technically only need to
> > wait for xl_prev, I believe we could push the computation of the other 12 bytes to
> > before acquiring the lock, then only execute a single instruction on xl_prev to
> > complete the CRC computation. Is there any reason why we couldn't do that,
> > assuming we have a clean way to make that portable? That would mean that the
> > CRCs between major versions would be different, but I think we don't guarantee
> > that anyway.
>
> Not sure about that. This is not my expertise and I might need a little time to figure this out. Unfortunately, I am
ontravel with limited internet connection for the next 6 weeks. I will only be able to address this when I get back. Is
thisa blocker for the patch or is this something we can address as a revision? 

This is orthogonal and is not related to the patch, since it doesn't
affect 8 and 20-byte paths, only 256 and greater.

--
John Naylor
Amazon Web Services



pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: doc: Mention clock synchronization recommendation for hot_standby_feedback
Next
From: Dilip Kumar
Date:
Subject: Re: Track the amount of time waiting due to cost_delay