Re: Popcount optimization using AVX512 - Mailing list pgsql-hackers
From | Nathan Bossart |
---|---|
Subject | Re: Popcount optimization using AVX512 |
Date | |
Msg-id | Zqlb1_-BWlzVMbZv@nathan Whole thread Raw |
In response to | Re: Popcount optimization using AVX512 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Popcount optimization using AVX512
Re: Popcount optimization using AVX512 |
List | pgsql-hackers |
On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote: > I've noticed that the configure probes for this are quite slow - pretty much > the slowest step in a meson setup (and autoconf is similar). While looking > into this, I also noticed that afaict the tests don't do the right thing for > msvc. > > ... > [6.825] Checking if "__sync_val_compare_and_swap(int64)" : links: YES > [6.883] Checking if " __atomic_compare_exchange_n(int32)" : links: YES > [6.940] Checking if " __atomic_compare_exchange_n(int64)" : links: YES > [7.481] Checking if "XSAVE intrinsics without -mxsave" : links: NO > [8.097] Checking if "XSAVE intrinsics with -mxsave" : links: YES > [8.641] Checking if "AVX-512 popcount without -mavx512vpopcntdq -mavx512bw" : links: NO > [9.183] Checking if "AVX-512 popcount with -mavx512vpopcntdq -mavx512bw" : links: YES > [9.242] Checking if "_mm_crc32_u8 and _mm_crc32_u32 without -msse4.2" : links: NO > [9.333] Checking if "_mm_crc32_u8 and _mm_crc32_u32 with -msse4.2" : links: YES > [9.367] Checking if "x86_64: popcntq instruction" compiles: YES > [9.382] Has header "atomic.h" : NO > ... > > (the times here are a bit exaggerated, enabling them in meson also turns on > python profiling, which makes everything a bit slower) > > > Looks like this is largely the fault of including immintrin.h: > > echo -e '#include <immintrin.h>\nint main(){return _xgetbv(0) & 0xe0;}'|time gcc -mxsave -xc - -o /dev/null > 0.45user 0.04system 0:00.50elapsed 99%CPU (0avgtext+0avgdata 94184maxresident)k > > echo -e '#include <immintrin.h>\n'|time gcc -c -mxsave -xc - -o /dev/null > 0.43user 0.03system 0:00.46elapsed 99%CPU (0avgtext+0avgdata 86004maxresident)k Interesting. Thanks for bringing this to my attention. > Do we really need to link the generated programs? If we instead were able to > just rely on the preprocessor, it'd be vastly faster. > > The __sync* and __atomic* checks actually need to link, as the compiler ends > up generating calls to unimplemented functions if the compilation target > doesn't support some operation natively - but I don't think that's true for > the xsave/avx512 stuff > > Afaict we could just check for predefined preprocessor macros: > > echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' > #define __AVX512BW__ 1 > #define __AVX512VPOPCNTDQ__ 1 > #define __XSAVE__ 1 > 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 13292maxresident)k > > echo|time gcc -c -march=nehalem -xc -dM -E - -o -|grep -E '__XSAVE__|__AVX512BW__|__AVX512VPOPCNTDQ__' > 0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 10972maxresident)k Seems promising. I can't think of a reason that wouldn't work. > Now, a reasonable counter-argument would be that only some of these macros are > defined for msvc ([1]). However, as it turns out, the test is broken > today, as msvc doesn't error out when using an intrinsic that's not > "available" by the target architecture, it seems to assume that the caller did > a cpuid check ahead of time. > > > Check out [2], it shows the various predefined macros for gcc, clang and msvc. > > > ISTM that the msvc checks for xsave/avx512 being broken should be an open > item? I'm not following this one. At the moment, we always do a runtime check for the AVX-512 stuff, so in the worst case we'd check CPUID at startup and set the function pointers appropriately, right? We could, of course, still fix it, though. -- nathan
pgsql-hackers by date: