Re: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Nathan Bossart
Subject Re: Popcount optimization using AVX512
Date
Msg-id 20240403225029.GA3851411@nathanxps13
Whole thread Raw
In response to Re: Popcount optimization using AVX512  (Ants Aasma <ants.aasma@cybertec.at>)
Responses Re: Popcount optimization using AVX512
Re: Popcount optimization using AVX512
List pgsql-hackers
On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote:
> On Tue, 2 Apr 2024 at 00:31, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
>> > What about using the masking capabilities of AVX-512 to handle the
>> > tail in the same code path? Masked out portions of a load instruction
>> > will not generate an exception. To allow byte level granularity
>> > masking, -mavx512bw is needed. Based on wikipedia this will only
>> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
>> > VPOPCNTQ implies availability of BW.
>>
>> Sounds promising.  IMHO we should really be sure that these kinds of loads
>> won't generate segfaults and the like due to the masked-out portions.  I
>> searched around a little bit but haven't found anything that seemed
>> definitive.
> 
> After sleeping on the problem, I think we can avoid this question
> altogether while making the code faster by using aligned accesses.
> Loads that straddle cache line boundaries run internally as 2 load
> operations. Gut feel says that there are enough out-of-order resources
> available to make it not matter in most cases. But even so, not doing
> the extra work is surely better. Attached is another approach that
> does aligned accesses, and thereby avoids going outside bounds.
> 
> Would be interesting to see how well that fares in the small use case.
> Anything that fits into one aligned cache line should be constant
> speed, and there is only one branch, but the mask setup and folding
> the separate popcounts together should add up to about 20-ish cycles
> of overhead.

I tested your patch in comparison to v25 and saw the following:

  bytes  v25       v25+ants
    2    1108.205  1033.132
    4    1311.227  1289.373
    8    1927.954  2360.113
   16    2281.091  2365.408
   32    3856.992  2390.688
   64    3648.72   3242.498
  128    4108.549  3607.148
  256    4910.076  4496.852

For 2 bytes and 4 bytes, the inlining should take effect, so any difference
there is likely just noise.  At 8 bytes, we are calling the function
pointer, and there is a small regression with the masking approach.
However, by 16 bytes, the masking approach is on par with v25, and it wins
for all larger buffers, although the gains seem to taper off a bit.

If we can verify this approach won't cause segfaults and can stomach the
regression between 8 and 16 bytes, I'd happily pivot to this approach so
that we can avoid the function call dance that I have in v25.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?
Next
From: Peter Eisentraut
Date:
Subject: Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?