RE: Popcount optimization using AVX512 - Mailing list pgsql-hackers
From | Amonson, Paul D |
---|---|
Subject | RE: Popcount optimization using AVX512 |
Date | |
Msg-id | BL1PR11MB5304165E7A81E0107E70B069DC242@BL1PR11MB5304.namprd11.prod.outlook.com Whole thread Raw |
In response to | Re: Popcount optimization using AVX512 (Nathan Bossart <nathandbossart@gmail.com>) |
Responses |
Re: Popcount optimization using AVX512
|
List | pgsql-hackers |
> -----Original Message----- > From: Nathan Bossart <nathandbossart@gmail.com> > Sent: Thursday, March 7, 2024 1:36 PM > Subject: Re: Popcount optimization using AVX512 I will be splitting the request into 2 patches. I am attaching the first patch (refactoring only) and I updated the commitfestentry to match this patch. I have a question however: Do I need to wait for the refactor patch to be merged before I post the AVX portion of this feature in this thread? > > + PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f]) > > I'm curious why we need both -mavx512vpopcntdq and -mavx512f. On my > machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are > other instructions required that need -mavx512f, then we might need to > expand the test. First, nice catch on the required flags to build! When I changed my algorithm, dependence on the -mavx512f flag was no longerneeded, In the second patch (AVX specific) I will fix this. > I still think it's worth breaking this change into at least 2 patches. In particular, > I think there's an opportunity to do the refactoring into pg_popcnt_choose.c > and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff. These > changes are likely straightforward, and getting them out of the way early > would make it easier to focus on the more interesting changes. IMHO there > are a lot of moving parts in this patch. As stated above I am doing this in 2 patches. :) > > +#undef HAVE__GET_CPUID_COUNT > > + > > +/* Define to 1 if you have immintrin. */ #undef HAVE__IMMINTRIN > > Is this missing HAVE__CPUIDEX? Yes I missed it, I will include in the second patch (AVX specific) of the 2 patches. > > uint64 > > -pg_popcount(const char *buf, int bytes) > > +pg_popcount_slow(const char *buf, int bytes) > > { > > uint64 popcnt = 0; > > > > -#if SIZEOF_VOID_P >= 8 > > +#if SIZEOF_VOID_P == 8 > > /* Process in 64-bit chunks if the buffer is aligned. */ > > if (buf == (const char *) TYPEALIGN(8, buf)) > > { > > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes) > > > > buf = (const char *) words; > > } > > -#else > > +#elif SIZEOF_VOID_P == 4 > > /* Process in 32-bit chunks if the buffer is aligned. */ > > if (buf == (const char *) TYPEALIGN(4, buf)) > > { > > Apologies for harping on this, but I'm still not seeing the need for these > SIZEOF_VOID_P changes. While it's unlikely that this makes any practical > difference, I see no reason to more strictly check SIZEOF_VOID_P here. I got rid of the second occurrence as I agree it is not needed but unless you see something I don't how to know which functionto call between a 32-bit and 64-bit architecture? Maybe I am missing something obvious? What exactly do you suggesthere? I am happy to always call either pg_popcount32() or pg_popcount64() with the understanding that it may not beoptimal, but I do need to know which to use. > > + /* Process any remaining bytes */ > > + while (bytes--) > > + popcnt += pg_number_of_ones[(unsigned char) *buf++]; > > + return popcnt; > > +#else > > + return pg_popcount_slow(buf, bytes); > > +#endif /* USE_AVX512_CODE */ > > nitpick: Could we call pg_popcount_slow() in a common section for these > "remaining bytes?" Agreed, will fix in the second patch as well. > > +#if defined(_MSC_VER) > > + pg_popcount_indirect = pg_popcount512_fast; #else > > + pg_popcount = pg_popcount512_fast; #endif > These _MSC_VER sections are interesting. I'm assuming this is the > workaround for the MSVC linking issue you mentioned above. I haven't > looked too closely, but I wonder if the CRC32C code (see > src/include/port/pg_crc32c.h) is doing something different to avoid this issue. Using the latest master branch, I see what the needed changes are, I will implement using PGDLLIMPORT macro in the secondpatch. > Upthread, Alvaro suggested a benchmark [0] that might be useful. I scanned > through this thread and didn't see any recent benchmark results for the latest > form of the patch. I think it's worth verifying that we are still seeing the > expected improvements. I will get new benchmarks using the same process I used before (from Akash) so I get apples to apples. These are pendingcompletion of the second patch which is still in progress. Just a reminder, I asked questions above about 1) multi-part dependent patches and, 2) What specifically to do about theSIZE_VOID_P checks. :) Thanks, Paul
Attachment
pgsql-hackers by date: