RE: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Shankaran, Akash
Subject RE: Popcount optimization using AVX512
Date
Msg-id PH0PR11MB5000EFC19DD2C07F09871161F2B1A@PH0PR11MB5000.namprd11.prod.outlook.com
Whole thread Raw
In response to Re: Popcount optimization using AVX512  (Nathan Bossart <nathandbossart@gmail.com>)
List pgsql-hackers
Sorry for the late response here. We spent some time researching and measuring the frequency impact of AVX512
instructionsused here.
 

>How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

AVX512 has light and heavy instructions. While the heavy AVX512 instructions have clock frequency implications, the
lightinstructions not so much. See [0] for more details. We captured EMON data for the benchmark used in this work, and
seethat the instructions are using the licensing level not meant for heavy AVX512 operations. This means the
instructionsfor popcount : _mm512_popcnt_epi64(), _mm512_reduce_add_epi64() are not going to have any significant
impacton CPU clock frequency. 
 
Clock frequency impact aside, we measured the same benchmark for gains on older Intel hardware and observe up to 18%
betterperformance on Intel Icelake. On older intel hardware, the popcntdq 512 instruction is not present so it won’t
work.If clock frequency is not affected, rest of workload should not be impacted in the case of mixed workloads. 
 

>Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Testing this on smaller block sizes < 8KiB shows that AVX512 compared to the current 64bit behavior shows slightly
lowerperformance, but with a large variance. We cannot conclude much from it. The testing with ANALYZE benchmark by
Nathanalso points to no visible impact as a result of using AVX512. The gains on larger dataset is easily evident, with
lessvariance. 
 
What are your thoughts if we introduce AVX512 popcount for smaller sizes as an optional feature initially, and then
testit more thoroughly over time on this particular use case? 
 

Regarding enablement, following the other responses related to function inlining, using ifunc and enabling future
intrinsicsupport, it seems a concrete solution would require further discussion. We’re attaching a patch to enable
AVX512,which can use AVX512 flags during build. For example:
 
  >make -E CFLAGS_AVX512="-mavx -mavx512dq -mavx512vpopcntdq -mavx512vl -march=icelake-server -DAVX512_POPCNT=1"

Thoughts or feedback on the approach in the patch? This solution should not impact anyone who doesn’t use the feature
i.e.AVX512. Open to additional ideas if this doesn’t seem like the right approach here.  
 

[0] https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/

-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com> 
Sent: Tuesday, November 7, 2023 12:15 PM
To: Noah Misch <noah@leadboat.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>; Amonson, Paul D
<paul.d.amonson@intel.com>;pgsql-hackers@lists.postgresql.org; Shankaran, Akash <akash.shankaran@intel.com>
 
Subject: Re: Popcount optimization using AVX512

On Mon, Nov 06, 2023 at 09:53:15PM -0800, Noah Misch wrote:
> On Mon, Nov 06, 2023 at 09:59:26PM -0600, Nathan Bossart wrote:
>> On Mon, Nov 06, 2023 at 07:15:01PM -0800, Noah Misch wrote:
>> > The glibc/gcc "ifunc" mechanism was designed to solve this problem 
>> > of choosing a function implementation based on the runtime CPU, 
>> > without incurring function pointer overhead.  I would not attempt 
>> > to use AVX512 on non-glibc systems, and I would use ifunc to select the desired popcount implementation on glibc:
>> > https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.ht
>> > ml
>> 
>> Thanks, that seems promising for the function pointer cases.  I'll 
>> plan on trying to convert one of the existing ones to use it.  BTW it 
>> looks like LLVM has something similar [0].
>> 
>> IIUC this unfortunately wouldn't help for cases where we wanted to 
>> keep stuff inlined, such as is_valid_ascii() and the functions in 
>> pg_lfind.h, unless we applied it to the calling functions, but that 
>> doesn't ѕound particularly maintainable.
> 
> Agreed, it doesn't solve inline cases.  If the gains are big enough, 
> we should move toward packages containing N CPU-specialized copies of 
> the postgres binary, with bin/postgres just exec'ing the right one.

I performed a quick test with ifunc on my x86 machine that ordinarily uses the runtime checks for the CRC32C code, and
Iactually see a consistent 3.5% regression for pg_waldump -z on 100M 65-byte records.  I've attached the patch used for
testing.

The multiple-copies-of-the-postgres-binary idea seems interesting.  That's probably not something that could be enabled
bydefault, but perhaps we could add support for a build option.
 

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Some performance degradation in REL_16 vs REL_15
Next
From: Tom Lane
Date:
Subject: Re: Allow tests to pass in OpenSSL FIPS mode