On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
>
> This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share
thepreliminary results with the community and get feedback for adding avx512 support for popcount.
>
> Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the
pg_popcount()in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this
implementationhas improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit
scenariosrelying on popcount.
How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.
> My setup:
>
> Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
> OS : Ubuntu 22.04
> GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".
>
> 1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
> a. Software only and
> b. SSE 64 bit version
> 2. I created an implementation using the following AVX512 intrinsics:
> a. _mm512_popcnt_epi64()
> b. _mm512_reduce_add_epi64()
> 3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed
[std::mt19937_64])
Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)