Re: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Popcount optimization using AVX512
Date
Msg-id CAEze2WjaFLhp7=Eo-mbSaHsoeq7ZEs00yZ2+FpSpijH+KN_hbA@mail.gmail.com
Whole thread Raw
In response to Popcount optimization using AVX512  ("Amonson, Paul D" <paul.d.amonson@intel.com>)
Responses Re: Popcount optimization using AVX512
List pgsql-hackers
On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
>
> This proposal showcases the speed-up provided to popcount feature when using AVX512 registers. The intent is to share
thepreliminary results with the community and get feedback for adding avx512 support for popcount. 
>
> Revisiting the previous discussion/improvements around this feature, I have created a micro-benchmark based on the
pg_popcount()in PostgreSQL's current implementations for x86_64 using the newer AVX512 intrinsics. Playing with this
implementationhas improved performance up to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will benefit
scenariosrelying on popcount. 

How does this compare to older CPUs, and more mixed workloads? IIRC,
the use of AVX512 (which I believe this instruction to be included in)
has significant implications for core clock frequency when those
instructions are being executed, reducing overall performance if
they're not a large part of the workload.

> My setup:
>
> Machine: AWS EC2 m7i - 16vcpu, 64gb RAM
> OS : Ubuntu 22.04
> GCC: 11.4 and 12.3 with flags "-mavx -mavx512vpopcntdq -mavx512vl -march=native -O2".
>
> 1. I copied the pg_popcount() implementation into a new C/C++ project using cmake/make.
>         a. Software only and
>         b. SSE 64 bit version
> 2. I created an implementation using the following AVX512 intrinsics:
>         a. _mm512_popcnt_epi64()
>         b. _mm512_reduce_add_epi64()
> 3. I tested random bit streams from 64 MiB to 1024 MiB in length (5 sizes; repeatable with RNG seed
[std::mt19937_64])

Apart from the two type functions bytea_bit_count and bit_bit_count
(which are not accessed in postgres' own systems, but which could want
to cover bytestreams of >BLCKSZ) the only popcount usages I could find
were on objects that fit on a page, i.e. <8KiB in size. How does
performance compare for bitstreams of such sizes, especially after any
CPU clock implications are taken into account?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: Xiang Gao
Date:
Subject: RE: CRC32C Parallel Computation Optimization on ARM
Next
From: Amit Kapila
Date:
Subject: Re: Synchronizing slots from primary to standby