Re: Popcount optimization using AVX512 - Mailing list pgsql-hackers
From | Nathan Bossart |
---|---|
Subject | Re: Popcount optimization using AVX512 |
Date | |
Msg-id | 20240307213600.GA563369@nathanxps13 Whole thread Raw |
In response to | Re: Popcount optimization using AVX512 (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
RE: Popcount optimization using AVX512
|
List | pgsql-hackers |
As promised... > +# Check for Intel AVX512 intrinsics to do POPCNT calculations. > +# > +PGAC_AVX512_POPCNT_INTRINSICS([]) > +if test x"$pgac_avx512_popcnt_intrinsics" != x"yes"; then > + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f]) > +fi > +AC_SUBST(CFLAGS_AVX512_POPCNT) I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are other instructions required that need -mavx512f, then we might need to expand the test. > 13 files changed, 657 insertions(+), 119 deletions(-) I still think it's worth breaking this change into at least 2 patches. In particular, I think there's an opportunity to do the refactoring into pg_popcnt_choose.c and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff. These changes are likely straightforward, and getting them out of the way early would make it easier to focus on the more interesting changes. IMHO there are a lot of moving parts in this patch. > +#undef HAVE__GET_CPUID_COUNT > + > +/* Define to 1 if you have immintrin. */ > +#undef HAVE__IMMINTRIN Is this missing HAVE__CPUIDEX? > uint64 > -pg_popcount(const char *buf, int bytes) > +pg_popcount_slow(const char *buf, int bytes) > { > uint64 popcnt = 0; > > -#if SIZEOF_VOID_P >= 8 > +#if SIZEOF_VOID_P == 8 > /* Process in 64-bit chunks if the buffer is aligned. */ > if (buf == (const char *) TYPEALIGN(8, buf)) > { > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes) > > buf = (const char *) words; > } > -#else > +#elif SIZEOF_VOID_P == 4 > /* Process in 32-bit chunks if the buffer is aligned. */ > if (buf == (const char *) TYPEALIGN(4, buf)) > { Apologies for harping on this, but I'm still not seeing the need for these SIZEOF_VOID_P changes. While it's unlikely that this makes any practical difference, I see no reason to more strictly check SIZEOF_VOID_P here. > + /* Process any remaining bytes */ > + while (bytes--) > + popcnt += pg_number_of_ones[(unsigned char) *buf++]; > + return popcnt; > +#else > + return pg_popcount_slow(buf, bytes); > +#endif /* USE_AVX512_CODE */ nitpick: Could we call pg_popcount_slow() in a common section for these "remaining bytes?" > +#if defined(_MSC_VER) > + pg_popcount_indirect = pg_popcount512_fast; > +#else > + pg_popcount = pg_popcount512_fast; > +#endif These _MSC_VER sections are interesting. I'm assuming this is the workaround for the MSVC linking issue you mentioned above. I haven't looked too closely, but I wonder if the CRC32C code (see src/include/port/pg_crc32c.h) is doing something different to avoid this issue. Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned through this thread and didn't see any recent benchmark results for the latest form of the patch. I think it's worth verifying that we are still seeing the expected improvements. [0] https://postgr.es/m/202402071953.5c4z7t6kl7ts%40alvherre.pgsql -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: