Re: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Popcount optimization using AVX512
Date
Msg-id Zqlb1_-BWlzVMbZv@nathan
Whole thread Raw
In response to Re: Popcount optimization using AVX512  (Andres Freund <andres@anarazel.de>)
Responses Re: Popcount optimization using AVX512
Re: Popcount optimization using AVX512
List pgsql-hackers
On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote:
> I've noticed that the configure probes for this are quite slow - pretty much
> the slowest step in a meson setup (and autoconf is similar).  While looking
> into this, I also noticed that afaict the tests don't do the right thing for
> msvc.
> 
> ...
> [6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES
> [6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES
> [6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES
> [7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO
> [8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES
> [8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO
> [9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES
> [9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: NO
> [9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES
> [9.367] Checking if "x86_64: popcntq instruction" compiles: YES
> [9.382] Has header "atomic.h" : NO
> ...
> 
> (the times here are a bit exaggerated, enabling them in meson also turns on
> python profiling, which makes everything a bit slower)
> 
> 
> Looks like this is largely the fault of including immintrin.h:
> 
> echo -e '#include <immintrin.h>\nint main(){return _xgetbv(0) & 0xe0;}'|time gcc -mxsave -xc  - -o /dev/null
> 0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)k
> 
> echo -e '#include <immintrin.h>\n'|time gcc -c -mxsave -xc  - -o /dev/null
> 0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k

Interesting.  Thanks for bringing this to my attention.

> Do we really need to link the generated programs? If we instead were able to
> just rely on the preprocessor, it'd be vastly faster.
> 
> The __sync* and __atomic* checks actually need to link, as the compiler ends
> up generating calls to unimplemented functions if the compilation target
> doesn't support some operation natively - but I don't think that's true for
> the xsave/avx512 stuff
> 
> Afaict we could just check for predefined preprocessor macros:
> 
> echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E  - -o -|grep -E
'__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
> #define __AVX512BW__ 1
> #define __AVX512VPOPCNTDQ__ 1
> #define __XSAVE__ 1
> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)k
> 
> echo|time gcc -c -march=nehalem -xc -dM -E  - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__'
> 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k

Seems promising.  I can't think of a reason that wouldn't work.

> Now, a reasonable counter-argument would be that only some of these macros are
> defined for msvc ([1]).  However, as it turns out, the test is broken
> today, as msvc doesn't error out when using an intrinsic that's not
> "available" by the target architecture, it seems to assume that the caller did
> a cpuid check ahead of time.
> 
> 
> Check out [2], it shows the various predefined macros for gcc, clang and msvc.
> 
> 
> ISTM that the msvc checks for xsave/avx512 being broken should be an open
> item?

I'm not following this one.  At the moment, we always do a runtime check
for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and
set the function pointers appropriately, right?  We could, of course, still
fix it, though.

-- 
nathan



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Popcount optimization using AVX512
Next
From: Andrew Dunstan
Date:
Subject: can we mark upper/lower/textlike functions leakproof?