Re: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Popcount optimization using AVX512
Date
Msg-id 20231107022240.GA729644@nathanxps13
Whole thread Raw
In response to Re: Popcount optimization using AVX512  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Responses Re: Popcount optimization using AVX512
List pgsql-hackers
On Fri, Nov 03, 2023 at 12:16:05PM +0100, Matthias van de Meent wrote:
> On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
>> This proposal showcases the speed-up provided to popcount feature when
>> using AVX512 registers. The intent is to share the preliminary results
>> with the community and get feedback for adding avx512 support for
>> popcount.
>>
>> Revisiting the previous discussion/improvements around this feature, I
>> have created a micro-benchmark based on the pg_popcount() in
>> PostgreSQL's current implementations for x86_64 using the newer AVX512
>> intrinsics. Playing with this implementation has improved performance up
>> to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will
>> benefit scenarios relying on popcount.

Nice.  I've been testing out AVX2 support in src/include/port/simd.h, and
the results look promising there, too.  I intend to start a new thread for
that (hopefully soon), but one open question I don't have a great answer
for yet is how to detect support for newer intrinsics.  So far, we've been
able to use function pointers (e.g., popcount, crc32c) or deduce support
via common predefined compiler macros (e.g., we assume SSE2 is supported if
the compiler is targeting 64-bit x86).  But the former introduces a
performance penalty, and we probably want to inline most of this stuff,
anyway.  And the latter limits us to stuff that has been around for a
decade or two.

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

> Apart from the two type functions bytea_bit_count and bit_bit_count
> (which are not accessed in postgres' own systems, but which could want
> to cover bytestreams of >BLCKSZ) the only popcount usages I could find
> were on objects that fit on a page, i.e. <8KiB in size. How does
> performance compare for bitstreams of such sizes, especially after any
> CPU clock implications are taken into account?

Yeah, the previous optimizations in this area appear to have used ANALYZE
as the benchmark, presumably because of visibilitymap_count().  I briefly
attempted to measure the difference with and without AVX512 support, but I
haven't noticed any difference thus far.  One complication for
visiblitymap_count() is that the data passed to pg_popcount64() is masked,
which requires a couple more intructions when you're using the intrinsics.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Making aggregate deserialization (and WAL receive) functions slightly faster
Next
From: Amit Kapila
Date:
Subject: Re: A recent message added to pg_upgade