RE: Popcount optimization using AVX512 - Mailing list pgsql-hackers
From | Shankaran, Akash |
---|---|
Subject | RE: Popcount optimization using AVX512 |
Date | |
Msg-id | PH0PR11MB5000EFC19DD2C07F09871161F2B1A@PH0PR11MB5000.namprd11.prod.outlook.com Whole thread Raw |
In response to | Re: Popcount optimization using AVX512 (Nathan Bossart <nathandbossart@gmail.com>) |
List | pgsql-hackers |
Sorry for the late response here. We spent some time researching and measuring the frequency impact of AVX512 instructionsused here. >How does this compare to older CPUs, and more mixed workloads? IIRC, the use of AVX512 (which I believe this instruction to be included in) has significant implications for core clock frequency when those instructions are being executed, reducing overall performance if they're not a large part of the workload. AVX512 has light and heavy instructions. While the heavy AVX512 instructions have clock frequency implications, the lightinstructions not so much. See [0] for more details. We captured EMON data for the benchmark used in this work, and seethat the instructions are using the licensing level not meant for heavy AVX512 operations. This means the instructionsfor popcount : _mm512_popcnt_epi64(), _mm512_reduce_add_epi64() are not going to have any significant impacton CPU clock frequency. Clock frequency impact aside, we measured the same benchmark for gains on older Intel hardware and observe up to 18% betterperformance on Intel Icelake. On older intel hardware, the popcntdq 512 instruction is not present so it won’t work.If clock frequency is not affected, rest of workload should not be impacted in the case of mixed workloads. >Apart from the two type functions bytea_bit_count and bit_bit_count (which are not accessed in postgres' own systems, but which could want to cover bytestreams of >BLCKSZ) the only popcount usages I could find were on objects that fit on a page, i.e. <8KiB in size. How does performance compare for bitstreams of such sizes, especially after any CPU clock implications are taken into account? Testing this on smaller block sizes < 8KiB shows that AVX512 compared to the current 64bit behavior shows slightly lowerperformance, but with a large variance. We cannot conclude much from it. The testing with ANALYZE benchmark by Nathanalso points to no visible impact as a result of using AVX512. The gains on larger dataset is easily evident, with lessvariance. What are your thoughts if we introduce AVX512 popcount for smaller sizes as an optional feature initially, and then testit more thoroughly over time on this particular use case? Regarding enablement, following the other responses related to function inlining, using ifunc and enabling future intrinsicsupport, it seems a concrete solution would require further discussion. We’re attaching a patch to enable AVX512,which can use AVX512 flags during build. For example: >make -E CFLAGS_AVX512="-mavx -mavx512dq -mavx512vpopcntdq -mavx512vl -march=icelake-server -DAVX512_POPCNT=1" Thoughts or feedback on the approach in the patch? This solution should not impact anyone who doesn’t use the feature i.e.AVX512. Open to additional ideas if this doesn’t seem like the right approach here. [0] https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ -----Original Message----- From: Nathan Bossart <nathandbossart@gmail.com> Sent: Tuesday, November 7, 2023 12:15 PM To: Noah Misch <noah@leadboat.com> Cc: Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; Amonson, Paul D <paul.d.amonson@intel.com>;pgsql-hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com> Subject: Re: Popcount optimization using AVX512 On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote: > On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote: >> On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote: >> > The glibc/gcc "ifunc" mechanism was designed to solve this problem >> > of choosing a function implementation based on the runtime CPU, >> > without incurring function pointer overhead. I would not attempt >> > to use AVX512 on non-glibc systems, and I would use ifunc to select the desired popcount implementation on glibc: >> > https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.ht >> > ml >> >> Thanks, that seems promising for the function pointer cases. I'll >> plan on trying to convert one of the existing ones to use it. BTW it >> looks like LLVM has something similar [0]. >> >> IIUC this unfortunately wouldn't help for cases where we wanted to >> keep stuff inlined, such as is_valid_ascii() and the functions in >> pg_lfind.h, unless we applied it to the calling functions, but that >> doesn't ѕound particularly maintainable. > > Agreed, it doesn't solve inline cases. If the gains are big enough, > we should move toward packages containing N CPU-specialized copies of > the postgres binary, with bin/postgres just exec'ing the right one. I performed a quick test with ifunc on my x86 machine that ordinarily uses the runtime checks for the CRC32C code, and Iactually see a consistent 3.5% regression for pg_waldump -z on 100M 65-byte records. I've attached the patch used for testing. The multiple-copies-of-the-postgres-binary idea seems interesting. That's probably not something that could be enabled bydefault, but perhaps we could add support for a build option. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: