RE: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Amonson, Paul D
Subject RE: Popcount optimization using AVX512
Date
Msg-id BL1PR11MB5304165E7A81E0107E70B069DC242@BL1PR11MB5304.namprd11.prod.outlook.com
Whole thread Raw
In response to Re: Popcount optimization using AVX512  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: Popcount optimization using AVX512
List pgsql-hackers
> -----Original Message-----
> From: Nathan Bossart <nathandbossart@gmail.com>
> Sent: Thursday, March 7, 2024 1:36 PM
> Subject: Re: Popcount optimization using AVX512

I will be splitting the request into 2 patches. I am attaching the first patch (refactoring only) and I updated the
commitfestentry to match this patch. I have a question however:
 
Do I need to wait for the refactor patch to be merged before I post the AVX portion of this feature in this thread?

> > +  PGAC_AVX512_POPCNT_INTRINSICS([-mavx512vpopcntdq -mavx512f])
> 
> I'm curious why we need both -mavx512vpopcntdq and -mavx512f.  On my
> machine, -mavx512vpopcntdq alone is enough to pass this test, so if there are
> other instructions required that need -mavx512f, then we might need to
> expand the test.

First, nice catch on the required flags to build! When I changed my algorithm, dependence on the -mavx512f flag was no
longerneeded, In the second patch (AVX specific) I will fix this.
 

> I still think it's worth breaking this change into at least 2 patches.  In particular,
> I think there's an opportunity to do the refactoring into pg_popcnt_choose.c
> and pg_popcnt_x86_64_accel.c prior to adding the AVX512 stuff.  These
> changes are likely straightforward, and getting them out of the way early
> would make it easier to focus on the more interesting changes.  IMHO there
> are a lot of moving parts in this patch.

As stated above I am doing this in 2 patches. :)

> > +#undef HAVE__GET_CPUID_COUNT
> > +
> > +/* Define to 1 if you have  immintrin. */ #undef HAVE__IMMINTRIN
> 
> Is this missing HAVE__CPUIDEX?

Yes I missed it, I will include in the second patch (AVX specific) of the 2 patches.

> >  uint64
> > -pg_popcount(const char *buf, int bytes)
> > +pg_popcount_slow(const char *buf, int bytes)
> >  {
> >      uint64      popcnt = 0;
> >
> > -#if SIZEOF_VOID_P >= 8
> > +#if SIZEOF_VOID_P == 8
> >      /* Process in 64-bit chunks if the buffer is aligned. */
> >      if (buf == (const char *) TYPEALIGN(8, buf))
> >      {
> > @@ -311,7 +224,7 @@ pg_popcount(const char *buf, int bytes)
> >
> >          buf = (const char *) words;
> >      }
> > -#else
> > +#elif SIZEOF_VOID_P == 4
> >      /* Process in 32-bit chunks if the buffer is aligned. */
> >      if (buf == (const char *) TYPEALIGN(4, buf))
> >      {
> 
> Apologies for harping on this, but I'm still not seeing the need for these
> SIZEOF_VOID_P changes.  While it's unlikely that this makes any practical
> difference, I see no reason to more strictly check SIZEOF_VOID_P here.

I got rid of the second occurrence as I agree it is not needed but unless you see something I don't how to know which
functionto call between a 32-bit and 64-bit architecture? Maybe I am missing something obvious? What exactly do you
suggesthere? I am happy to always call either pg_popcount32() or pg_popcount64() with the understanding that it may not
beoptimal, but I do need to know which to use.
 

> > +    /* Process any remaining bytes */
> > +    while (bytes--)
> > +        popcnt += pg_number_of_ones[(unsigned char) *buf++];
> > +    return popcnt;
> > +#else
> > +    return pg_popcount_slow(buf, bytes);
> > +#endif                          /* USE_AVX512_CODE */
> 
> nitpick: Could we call pg_popcount_slow() in a common section for these
> "remaining bytes?"

Agreed, will fix in the second patch as well.

> > +#if defined(_MSC_VER)
> > +        pg_popcount_indirect = pg_popcount512_fast; #else
> > +        pg_popcount = pg_popcount512_fast; #endif

> These _MSC_VER sections are interesting.  I'm assuming this is the
> workaround for the MSVC linking issue you mentioned above.  I haven't
> looked too closely, but I wonder if the CRC32C code (see
> src/include/port/pg_crc32c.h) is doing something different to avoid this issue.

Using the latest master branch, I see what the needed changes are, I will implement using PGDLLIMPORT macro in the
secondpatch.
 

> Upthread, Alvaro suggested a benchmark [0] that might be useful.  I scanned
> through this thread and didn't see any recent benchmark results for the latest
> form of the patch.  I think it's worth verifying that we are still seeing the
> expected improvements.

I will get new benchmarks using the same process I used before (from Akash) so I get apples to apples. These are
pendingcompletion of the second patch which is still in progress.
 

Just a reminder, I asked questions above about 1) multi-part dependent patches and, 2) What specifically to do about
theSIZE_VOID_P checks. :)
 

Thanks,
Paul


Attachment

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Reports on obsolete Postgres versions
Next
From: Heikki Linnakangas
Date:
Subject: Re: Combine Prune and Freeze records emitted by vacuum