Re: CRC32C Parallel Computation Optimization on ARM - Mailing list pgsql-hackers

From John Naylor
Subject Re: CRC32C Parallel Computation Optimization on ARM
Date
Msg-id CANWCAZbO46fMgK1K5Tk24HLh9dc8cwFnK1v1Q=dxqLkfweO9ig@mail.gmail.com
Whole thread Raw
In response to Re: CRC32C Parallel Computation Optimization on ARM  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:

> > and how light it was. With more hardware support, we can go much lower
> > than 1024 bytes, but that can be left for future work.
>
> Nice.  I'm curious how this compares to both the existing implementations
> and the proposed ones that require new intrinsics.  I like the idea of
> avoiding new runtime and config checks, especially if the performance is
> somewhat comparable for the most popular cases (i.e., dozens of bytes to a
> few thousand bytes).

With 8k inputs on x86 its fairly close to 3x faster than master.

I wasn't very clear, but v9 still has a cutoff of 1008 bytes just to
copy from 0008, but on a slightly old machine the crossover point is
about 400-600 bytes. Doing microbenchmarks that hammer on single
instructions is very finicky, so I don't trust these numbers much.

With hardware CLMUL, I'm guessing cutoff would be between 120 and 192
bytes (must be a multiple of 24 -- 3 words), and would depend on
architecture. Arm has an advantage that vmull_p64() operates on
scalars, but on x86 the corresponding operation is
_mm_clmulepi64_si128() , and there's a bit of shuffling in and out of
vector registers.

> If we still want to add new intrinsics, would it be easy enough to add them
> on top of this patch?  Or would it require further restructuring?

I'm still trying to wrap my head around how function selection works
after commit 4b03a27fafc , but it could be something like this on x86:

#if defined(__has_attribute) && __has_attribute (target)

pg_attribute_target("sse4.2,pclmul")
pg_comp_crc32c_sse42
{
  <big loop with special case for end>
  <hardware carryless multiply>
  <tail>
}

#endif

pg_attribute_target("sse4.2")
pg_comp_crc32c_sse42
{
  <big loop>
  <software carryless multiply>
  <tail>
}

...where we have the tail part in a separate function for readability.

On Arm it might have to be as complex as in 0008, since as you've
mentioned, compiler support for the needed attributes is still pretty
new.

--
John Naylor
Amazon Web Services



pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: Pass ParseState as down to utility functions.
Next
From: Peter Smith
Date:
Subject: Re: pg_createsubscriber TAP test wrapping makes command options hard to read.