RE: Popcount optimization using AVX512 - Mailing list pgsql-hackers

From Amonson, Paul D
Subject RE: Popcount optimization using AVX512
Date
Msg-id BL1PR11MB5304E4A5AA87D9F0126B4502DC462@BL1PR11MB5304.namprd11.prod.outlook.com
Whole thread Raw
In response to Re: Popcount optimization using AVX512  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Popcount optimization using AVX512
List pgsql-hackers
Hi All,

> However, it would be much better if the improved code were available for
> all relevant builds and activated if a CPUID test determines that the
> relevant instructions are available, instead of requiring a compile-time
> flag -- which most builds are not going to use, thus wasting the
> opportunity for running the optimized code.
 
This makes sense. I addressed the feedback, and am attaching an updated patch. Patch also addresses your feedback of
autconfconfigurations by adding CFLAG support. I tested the runtime check for AVX512 on multiple processors with and
withoutAVX512 and it detected or failed to detect the feature as expected.
 
 
> Looking at the existing code, I would also consider renaming
> the "_fast" variants to something like pg_popcount32_asml/
> pg_popcount64_asmq so that you can name the new one pg_popcount64_asmdq
> or such.
 
I left out the renaming, as it made sense to keep the fast/slow naming for readability.
 
> Finally, the matter of using ifunc as proposed by Noah seems to be still
> in the air, with no patches offered for the popcount family. Given that
> Nathan reports [1] a performance decrease, maybe we should set that
> thought aside for now and continue to use function pointers.
 
Since there are improvements without it (results below), I agree with you to continue using function pointers.
 
I collected data on machines with, and without AVX512 support, using a table with 1M rows and performing SQL
bit_count()on a char column containing (84bytes, 4KiB, 8KiB, 16KiB).
 
    * On non-AVX 512 hardware: no regression or impact at runtime with code built with AVX 512 support in the binary
betweenthe patched and unpatched servers.
 
    * On AVX512 hardware: the max improvement I saw was 17% but was averaged closer to 6.5% on a bare-metal machine.
Thebenefit is lower on smaller cloud VMs on AWS (1 - 3%)
 
 
If the patch looks good, please suggest next steps on committing it.
 
Paul

-----Original Message-----
From: Alvaro Herrera <alvherre@alvh.no-ip.org> 
Sent: Thursday, January 25, 2024 1:49 AM
To: Shankaran, Akash <akash.shankaran@intel.com>
Cc: Nathan Bossart <nathandbossart@gmail.com>; Noah Misch <noah@leadboat.com>; Amonson, Paul D
<paul.d.amonson@intel.com>;Tom Lane <tgl@sss.pgh.pa.us>; Matthias van de Meent <boekewurm+postgres@gmail.com>;
pgsql-hackers@lists.postgresql.org
Subject: Re: Popcount optimization using AVX512

On 2024-Jan-25, Shankaran, Akash wrote:

> With the updated patch, we observed significant improvements and 
> handily beat the previous popcount algorithm performance. No 
> regressions in any scenario are observed:
> Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
> Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same 
> microbenchmark described initially in this thread.

These are great results.

However, it would be much better if the improved code were available for all relevant builds and activated if a CPUID
testdetermines that the relevant instructions are available, instead of requiring a compile-time flag -- which most
buildsare not going to use, thus wasting the opportunity for running the optimized code.
 

I suppose this would require patching pg_popcount64_choose() to be more specific.  Looking at the existing code, I
wouldalso consider renaming the "_fast" variants to something like pg_popcount32_asml/ pg_popcount64_asmq so that you
canname the new one pg_popcount64_asmdq or such.  (Or maybe leave the 32-bit version alone as "fast/slow", since
there'sno third option for that one -- or do I misread?)
 

I also think this needs to move the CFLAGS-decision-making elsewhere; asking the user to get it right is too much of a
burden. Is it workable to simply verify compiler support for the additional flags needed, and if so add them to a new
CFLAGS_BITUTILSvariable or such?  We already have the CFLAGS_CRC model that should be easy to follow.  Should be easy
enoughto mostly copy what's in configure.ac and meson.build, right?
 

Finally, the matter of using ifunc as proposed by Noah seems to be still in the air, with no patches offered for the
popcountfamily.  Given that Nathan reports [1] a performance decrease, maybe we should set that thought aside for now
andcontinue to use function pointers.  It's worth keeping in mind that popcount is already using function pointers (at
leastin the case where we try to use POPCNT directly), so patching to select between three options instead of between
twowouldn't be a regression.
 

[1] https://postgr.es/m/20231107201441.GA898662@nathanxps13

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: confusing / inefficient "need_transcoding" handling in copy
Next
From: Nathan Bossart
Date:
Subject: Re: Psql meta-command conninfo+