RE: Improve CRC32C performance on SSE4.2 - Mailing list pgsql-hackers

From Devulapalli, Raghuveer
Subject RE: Improve CRC32C performance on SSE4.2
Date
Msg-id PH8PR11MB82862B464609340FB0A4BA37FBCD2@PH8PR11MB8286.namprd11.prod.outlook.com
Whole thread Raw
In response to Re: Improve CRC32C performance on SSE4.2  (John Naylor <johncnaylorls@gmail.com>)
Responses Re: Improve CRC32C performance on SSE4.2
List pgsql-hackers
Hi John, 

Going back to your earlier comment: 

> > > I'm not a fan of exposing these architecture-specific details to
> > > places that consult the capabilities. That requires things like
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)

> > Isn't that one thing currently pg_cpu_have(FEATURE)?
> 
> No, I mean the one thing would be "do we have hardware CRC?", and each
> architecture would set (and check if needed) the same slot.
> 
> I'm not 100% sure this is the better way, and there may be a disadvantage I
> haven't thought of, but this makes the most sense to me. I'm willing to hear
> reasons to do it differently, but I'm thinking those reasons need to be weighed
> against the loss of abstraction.

IMO, CPU SIMD (SSE4.2, AVX, etc.) features are a module of their own separate from capabilities/features supported in
postgres(CRC32C, bitcount, etc.). Exposing architecture-specific details to the features that need to check against
themis a way to make code more modular and reusable. It is reasonable to expect developers writing SIMD specific
functionsto naturally understand the runtime architecture requirements for that function which is easily accessible by
justchecking the value of pg_cpu_have(PG_CPU_FEATURE_..). If another feature in postgres requires SSE4.2, then the CPU
initializationmodule doesn't need to be modified at all. They just have to worry about their feature and its CPU
requirements.
 

> Just so we're on the same page, the current separate files do initialization, not
> dispatch. I'm seeing conflicting design goals
> here:
> 
> 1. Have multiple similar dispatch functions with the only difference (for now)
> being certain names (not in the patch, just discussed).
> 2. Put initialization routines in the same file, even though they have literally
> nothing in common.

Sorry for not being clear. I will move the initialization routines to separate files. Just haven’t gotten to it yet
withv10. 
 

> 
> > This also makes it easier to add more implementations in the future without
> having to make the dispatch function work for both ARM and x86.
> 
> I don't quite understand, since with 0001 it already works (at least in theory) for 3
> architectures with hardware CRC, plus those without, and I don't see it changing
> with more architectures.

The reason its working across all architectures as of now is because there is only one runtime check for CRC32C feature
acrossx86, ARM and LongArch. (PCLMUL dispatch happens inside the SSE42). If and when we add the AVX-512 implementation,
won'tthe dispatch logic will look different for x86 and ARM? Along with that, the header file pg_crc32c.h will also
looka lot messier with a whole bunch of nested #ifdefs.  IMO, the header file should be devoid of any architecture
dispatchlogic and simply contain declarations of various implementations (see
https://github.com/r-devulap/postgressql/blob/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/src/include/port/pg_crc32c.hfor
example).The dispatch logic should be handled separately in a C file. 
 

> 
> +/* Access to pg_cpucap for modules that need runtime CPUID information
> +*/ bool pg_cpu_have(int feature_id) {
> +    return pg_cpucap[feature_id];
>  }
> 
> Putting this in a separate translation may have to do the decision to turn v8's
> #defines into an enum? In any case, this means doing a function call to decide
> which function to call. That makes no sense.

The reason it is a separate translational unit is because I intended pg_cpucap to be a read only variable outside of
pg_cpucap.c.If the function overhead is not preferred, then I can get rid of it.
 

> 
> The goal of v9/10 was to centralize the initialization from cpuid etc, but it also did
> a major re-architecture of the runtime-checks at the same. That made review
> more difficult.

I assumed we wanted to move the runtime checks to a central place and that needed this re-architecture. 

> 
> +#if defined(__x86_64__) || defined ...
> +    pg_cpucap_x86();
> +#elif defined(__aarch64__) || defined ...
> +    pg_cpucap_arm();
> +#endif
> 
> I view this another conflict in design goals: After putting everything in the same
> file, the routines still have different names and have to be guarded accordingly. I
> don't really see the advantage, especially since these guard symbols don't even
> match the ones that guard the actual initialization steps. I tried to imply that in
> my last review, but maybe I should have been more explicit.
> 
> I think the least painful step is to take the x86 initialization from v10, which is
> looking great, but
> - keep separate initialization files
> - don't whack around the runtime representation, at least not in the same patch

Sure. I can fix that in v11. 

Raghuveer





pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Update docs for UUID data type
Next
From: Tom Lane
Date:
Subject: Re: Should work_mem be stable for a prepared statement?