Re: Proposal for Updating CRC32C with AVX-512 Algorithm. - Mailing list pgsql-hackers

From John Naylor
Subject Re: Proposal for Updating CRC32C with AVX-512 Algorithm.
Date
Msg-id CANWCAZZSs2YANJENsn0TDm14-Q7vVrieMyKrSO3AyzGG=c7SUw@mail.gmail.com
Whole thread Raw
In response to Re: Proposal for Updating CRC32C with AVX-512 Algorithm.  (Andres Freund <andres@anarazel.de>)
Responses RE: Proposal for Updating CRC32C with AVX-512 Algorithm.
List pgsql-hackers
On Thu, Jun 13, 2024 at 3:11 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2024-05-01 15:56:08 +0000, Amonson, Paul D wrote:

> > Workload call size distribution details (write heavy):
> >    * Average was approximately around 1,010 bytes per call
> >    * ~80% of the calls were under 256 bytes
> >    * ~20% of the calls were greater than or equal to 256 bytes up to the max buffer size of 8192
>
> This is extremely workload dependent, it's not hard to find workloads with
> lots of very small record and very few big ones...  What you observed might
> have "just" been the warmup behaviour where more full page writes have to be
> written.

Sorry for going back so far, but this thread was pointed out to me,
and this aspect of the design could use some more discussion:

+ * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
+ * buffer length must be at least 256, and a multiple of 64. Based

There is another technique that computes CRC on 3 separate chunks and
combines them at the end, so about 3x faster on large-enough chunks.
That's the way used for the Arm proposal [0], coincidentally also
citing a white paper from Intel, but as Dimitry pointed out in that
thread, its link has apparently disappeared. Raghuveer, do you know
about this, and is there another link available?


http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

The cut off point in one implementation is only 144 bytes [1] , which
is maybe not as small as we'd like, but is quite a bit smaller than
256. That seems better suited to our workloads, and more portable. I
have a *brand-new* laptop with an Intel chip, and IIUC it doesn't
support AVX-512 because it uses a big-little architecture. I also
understand that Sierra Forrest (a server product line) will be all
little cores with no AVX-512 support, so I'm not sure why the proposal
here requires AVX-512.

> There a very frequent call computing COMP_CRC32C over just 20 bytes, while
> holding a crucial lock.  If we were to do introduce something like this
> AVX-512 algorithm, it'd probably be worth to dispatch differently in case of
> compile-time known small lengths.

I know you've read an earlier version of the patch and realized that
it wouldn't help here, but we could probably dispatch differently
regardless, although it may only be worth it if we can inline the
instructions. Since we technically only need to wait for xl_prev, I
believe we could push the computation of the other 12 bytes to before
acquiring the lock, then only execute a single instruction on xl_prev
to complete the CRC computation. Is there any reason why we couldn't
do that, assuming we have a clean way to make that portable? That
would mean that the CRCs between major versions would be different,
but I think we don't guarantee that anyway.

[0] https://commitfest.postgresql.org/50/4620/
[1] https://github.com/komrad36/CRC/blob/master/CRC/golden_intel.cpp#L138C27-L138C42


--
John Naylor
Amazon Web Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Cannot find a working 64-bit integer type on Illumos
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Memory leak in WAL sender with pgoutput (v10~)