RE: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Amonson, Paul D
Subject RE: Popcount optimization using AVX512
Date
Msg-id BL1PR11MB5304F65FD8D19B608F9E8624DC482@BL1PR11MB5304.namprd11.prod.outlook.com
Whole thread Raw
In response to Re: Popcount optimization using AVX512  (Andres Freund <andres@anarazel.de>)
Responses Re: Popcount optimization using AVX512
List pgsql-hackers
My responses with questions,

> > +# XXX: The configure.ac check for __cpuidex() is broken, we don't 
> > +copy that # here. To prevent problems due to two detection methods 
> > +working, stop # checking after one.
>
> This seems like a bogus copy-paste.

My bad. Will remove the offending comment.  :)

> > +# Check for header immintrin.h
> ...
> Do these all actually have to link?  Invoking the linker is slow.
> I think you might be able to just use cc.has_header_symbol().

I took this to mean the last of the 3 new blocks. I changed this one to the cc_has_header method. I think I do want the
first2 checking the link as well. If the don't link here they won't link in the actual build.
 

> Does this work with msvc?

I think it will work but I have no way to validate it. I propose we remove the AVX-512 popcount feature from MSVC
builds.Sound ok?
 

> That's a very long line in the output, how about using the avx feature name or something?

Agree, will fix.

> This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject
automaticuses of avx512 in places, which would cause problems, no?
 

This will take me some time to learn how to do this in meson. Any pointers here would be helpful. 

> While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly
thatreson? That's why the existing code builds the files for various crc variants as their own file.
 

I don't think its broken, nothing else in pg_bitutils.c will make use of AVX-512, so I am not sure what dividing this
upinto multiple files will yield benefits beyond code readability as they will all be needed during compile time. I
preferto not split if the community agrees to it.
 
 
If splitting still makes sense, I propose splitting into 3 files:  pg_bitutils.c (entry point +sw popcnt
implementation),pg_popcnt_choose.c (CPUID and xgetbv check) and pg_popcnt_x86_64_accel.c (64/512bit x86
implementations).
 
I'm not an expert in meson, but splitting might add complexity to meson.build. 

Could you elaborate if there are other benefits to the split file approach?

Paul


-----Original Message-----
From: Andres Freund <andres@anarazel.de> 
Sent: Friday, February 9, 2024 10:35 AM
To: Amonson, Paul D <paul.d.amonson@intel.com>
Cc: Alvaro Herrera <alvherre@alvh.no-ip.org>; Shankaran, Akash <akash.shankaran@intel.com>; Nathan Bossart
<nathandbossart@gmail.com>;Noah Misch <noah@leadboat.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent
<boekewurm+postgres@gmail.com>;pgsql-hackers@lists.postgresql.org
 
Subject: Re: Popcount optimization using AVX512

Hi,

On 2024-02-09 17:39:46 +0000, Amonson, Paul D wrote:

> diff --git a/meson.build b/meson.build index 8ed51b6aae..1e7a4dc942 
> 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1773,6 +1773,45 @@ elif cc.links('''
>  endif
>  
>  
> +# XXX: The configure.ac check for __cpuidex() is broken, we don't 
> +copy that # here. To prevent problems due to two detection methods 
> +working, stop # checking after one.

This seems like a bogus copy-paste.


> +if cc.links('''
> +    #include <cpuid.h>
> +    int main(int arg, char **argv)
> +    {
> +        unsigned int exx[4] = {0, 0, 0, 0};
> +        __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]);
> +    }
> +    ''', name: '__get_cpuid_count',
> +    args: test_c_args)
> +  cdata.set('HAVE__GET_CPUID_COUNT', 1) elif cc.links('''
> +    #include <intrin.h>
> +    int main(int arg, char **argv)
> +    {
> +        unsigned int exx[4] = {0, 0, 0, 0};
> +        __cpuidex(exx, 7, 0);
> +    }
> +    ''', name: '__cpuidex',
> +    args: test_c_args)
> +  cdata.set('HAVE__CPUIDEX', 1)
> +endif
> +
> +
> +# Check for header immintrin.h
> +if cc.links('''
> +    #include <immintrin.h>
> +    int main(int arg, char **argv)
> +    {
> +      return 1701;
> +    }
> +    ''', name: '__immintrin',
> +    args: test_c_args)
> +  cdata.set('HAVE__IMMINTRIN', 1)
> +endif

Do these all actually have to link?  Invoking the linker is slow.

I think you might be able to just use cc.has_header_symbol().



> +###############################################################
> +# AVX 512 POPCNT Intrinsic check
> +###############################################################
> +have_avx512_popcnt = false
> +cflags_avx512_popcnt = []
> +if host_cpu == 'x86_64'
> +  prog = '''
> +      #include <immintrin.h>
> +      #include <stdint.h>
> +      void main(void)
> +      {
> +        __m512i tmp __attribute__((aligned(64)));
> +        __m512i input = _mm512_setzero_si512();
> +        __m512i output = _mm512_popcnt_epi64(input);
> +        uint64_t cnt = 999;
> +        _mm512_store_si512(&tmp, output);
> +        cnt = _mm512_reduce_add_epi64(tmp);
> +        /* return computed value, to prevent the above being optimized away */
> +        return cnt == 0;
> +      }'''

Does this work with msvc?


> +    if cc.links(prog, name: '_mm512_setzero_si512, 
> + _mm512_popcnt_epi64, _mm512_store_si512, and _mm512_reduce_add_epi64 
> + with -mavx512vpopcntdq -mavx512f',

That's a very long line in the output, how about using the avx feature name or something?



> diff --git a/src/port/Makefile b/src/port/Makefile index 
> dcc8737e68..6a01a7d89a 100644
> --- a/src/port/Makefile
> +++ b/src/port/Makefile
> @@ -87,6 +87,11 @@ pg_crc32c_sse42.o: CFLAGS+=$(CFLAGS_CRC)
>  pg_crc32c_sse42_shlib.o: CFLAGS+=$(CFLAGS_CRC)
>  pg_crc32c_sse42_srv.o: CFLAGS+=$(CFLAGS_CRC)
>  
> +# Newer Intel processors can use AVX-512 POPCNT Capabilities 
> +(01/30/2024)
> +pg_bitutils.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
> +pg_bitutils_shlib.o: CFLAGS+=$(CFLAGS_AVX512_POPCNT)
> +pg_bitutils_srv.o:CFLAGS+=$(CFLAGS_AVX512_POPCNT)
> +
>  # all versions of pg_crc32c_armv8.o need CFLAGS_CRC
>  pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
>  pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC) diff --git 
> a/src/port/meson.build b/src/port/meson.build index 
> 69b30ab21b..1c48a3b07e 100644
> --- a/src/port/meson.build
> +++ b/src/port/meson.build
> @@ -184,6 +184,7 @@ foreach name, opts : pgport_variants
>        link_with: cflag_libs,
>        c_pch: pch_c_h,
>        kwargs: opts + {
> +        'c_args': opts.get('c_args', []) + cflags_avx512_popcnt,
>          'dependencies': opts['dependencies'] + [ssl],
>        }
>      )

This will build all of pgport with the avx flags, which wouldn't be correct, I think? The compiler might inject
automaticuses of avx512 in places, which would cause problems, no?
 

While you don't do the same for make, isn't even just using the avx512 for all of pg_bitutils.c broken for exactly that
reson?That's why the existing code builds the files for various crc variants as their own file.
 


Greetings,

Andres Freund

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: backend *.c #include cleanup (IWYU)
Next
From: Melanie Plageman
Date:
Subject: Why does BitmapPrefetch() skip fetch based on current block recheck flag