Re: Improve CRC32C performance on SSE4.2 - Mailing list pgsql-hackers

From John Naylor
Subject Re: Improve CRC32C performance on SSE4.2
Date
Msg-id CANWCAZbG_pPFuLRaczYApBykaHCXQ-STAXO1=J6_4TRvy23Byg@mail.gmail.com
Whole thread Raw
In response to RE: Improve CRC32C performance on SSE4.2  ("Devulapalli, Raghuveer" <raghuveer.devulapalli@intel.com>)
List pgsql-hackers
On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
>
> > I agree it would be preferable to make a centralized check work.
>
> Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the
AVX512popcount code. Any new feature that requires CPUID information can access that information with
pg_cpu_have[FEATURE]defined in pg_cpucap.h and initialized in pg_cpucap.c. v8 also had a typo in configure files which
causeda build failure. Its fixed in v9. 
>
> Pretty sure the ARM code paths need some correction. Let me know what you think.

Thanks, I think this is a good direction. Some comments:

+static void pg_cpuid(int leaf, int subleaf, unsigned int* exx)
+{
+#if defined(HAVE__GET_CPUID_COUNT)
+ __get_cpuid_count(leaf, subleaf, &exx[0], &exx[1], &exx[2], &exx[3]);
+#elif defined(HAVE__CPUIDEX)
+ __cpuidex(exx, leaf, subleaf);
+#else
+#error cpuid instruction not available
+#endif
+}

Our current configure still looks for __get_cpuid and __cpuid. We
committed checking these new ones fairly recently, and they were
further gated by USE_AVX512_POPCNT_WITH_RUNTIME_CHECK. It seems like
here we should do something like the following, where "+" lines are
from the patch and other lines are mine:

+static void
+pg_cpucap_x86(void)
+{
+ unsigned int exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID)
__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUID)
__cpuid(exx, 1);
#endif

+    pg_cpucap[PG_CPU_FEATURE_SSE42] = (exx[2] & (1 << 20)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_PCLMUL] = (exx[2] & (1 << 1)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_POPCNT] = (exx[2] & (1 << 23)) != 0;
+    /* osxsave */
+    if ((exx[2] & (1 << 27)) == 0) {
+        return;
+    }
+    /* avx512 os support */
+    if (zmm_regs_available()) {
+        return;
+    }

// BTW, I do like the gating here that reduces the number of places
that have to know about zmm and xsave. (Side note: shouldn't that be
"if !(zmm_regs_available())"?)

+    /* reset for second cpuid call on leaf 7 to check extended avx512
support */

exx[4] = {0, 0, 0, 0};

#if defined(HAVE__GET_CPUID_COUNT)
    __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
#elif defined(HAVE__CPUIDEX)
    __cpuidex(exx, 7, 0);
#endif

+    pg_cpucap[PG_CPU_FEATURE_AVX512F]  = (exx[1] & (1 << 16)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_AVX512BW]        = (exx[1] & (1 << 30)) != 0;
+    pg_cpucap[PG_CPU_FEATURE_AVX512VPOPCNTDQ] = (exx[2] & (1 << 14)) != 0;
+
+}

What do you think?

-#define PGCPUCAP_INIT           (1 << 0)
-#define PGCPUCAP_POPCNT         (1 << 1)
-#define PGCPUCAP_VPOPCNT        (1 << 2)
-#define PGCPUCAP_CRC32C         (1 << 3)
-#define PGCPUCAP_CLMUL          (1 << 4)
+enum pg_cpucap__
+{
+    PG_CPU_FEATURE_INIT              = 0,
+    // X86
+    PG_CPU_FEATURE_SSE42             = 1,
+    PG_CPU_FEATURE_POPCNT            = 2,
+    PG_CPU_FEATURE_PCLMUL            = 3,
[...]
+    PG_CPU_FEATURE_ARMV8_CRC32C      = 100,

I'm not a fan of exposing these architecture-specific details to
places that consult the capabilities. That requires things like

+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
[...]
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_ARMV8_CRC32C)

I'd prefer to have 1 capability <-> one place to check at runtime for
architectures that need that, and to keep architecture details private
to the initialization step. . Even for things that test for which
function pointer to use, I think it's a cleaner interface to look at
one thing.

+static void
+pg_cpucap_arch()
+{
+    /* WIP: configure checks */
+#ifdef __x86_64__
+    pg_cpucap_x86();
+#else // ARM:
+    pg_cpucap_arm();
+#endif
+}

If we're going to have a single file for the init step, we don't need
this -- we'd just have a different definition of
pg_cpucap_initialize() in each part, with a default that only adds the
"init" slot:

#if defined( __i386__ ) || defined(i386) || defined(_M_IX86) ||
  defined(__x86_64__) || defined(__x86_64) || defined(_M_AMD64)

<cpuid checks>

#elif defined(__arm__) || defined(__arm) ||
   defined(__aarch64__) || defined(_M_ARM64)

<for now only pg_crc32c_armv8_available()>

#else
<only init>
#endif

--

John Naylor
Amazon Web Services



pgsql-hackers by date:

Previous
From: Paul Jungwirth
Date:
Subject: Re: SQL:2011 application time
Next
From: Michael Paquier
Date:
Subject: Re: Fix logging for invalid recovery timeline