Thread: autovectorize page checksum code included elsewhere

autovectorize page checksum code included elsewhere

From
Nathan Bossart
Date:
(Unfortunately, I'm posting this too late for the November commitfest, but
I'm hoping this will be the first in a series of proposed improvements
involving SIMD instructions for v17.)

Presently, we ask compilers to autovectorize checksum.c and numeric.c.  The
page checksum code actually lives in checksum_impl.h, and checksum.c just
includes it.  But checksum_impl.h is also used in pg_upgrade/file.c and
pg_checksums.c, and since we don't ask compilers to autovectorize those
files, the page checksum code may remain un-vectorized.

The attached patch is a quick attempt at adding CFLAGS_UNROLL_LOOPS and
CFLAGS_VECTORIZE to the CFLAGS for the aforementioned objects.  The gains
are modest (i.e., some system CPU and/or a few percentage points on the
total time), but it seemed like a no-brainer.

Separately, I'm wondering whether we should consider using CFLAGS_VECTORIZE
on the whole tree.  Commit fdea253 seems to be responsible for introducing
this targeted autovectorization strategy, and AFAICT this was just done to
minimize the impact elsewhere while optimizing page checksums.  Are there
fundamental problems with adding CFLAGS_VECTORIZE everywhere?  Or is it
just waiting on someone to do the analysis/benchmarking?

[0] https://postgr.es/m/1367013190.11576.249.camel%40sussancws0025

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: autovectorize page checksum code included elsewhere

From
John Naylor
Date:
On Tue, Nov 7, 2023 at 9:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Separately, I'm wondering whether we should consider using CFLAGS_VECTORIZE
> on the whole tree.  Commit fdea253 seems to be responsible for introducing
> this targeted autovectorization strategy, and AFAICT this was just done to
> minimize the impact elsewhere while optimizing page checksums.  Are there
> fundamental problems with adding CFLAGS_VECTORIZE everywhere?  Or is it
> just waiting on someone to do the analysis/benchmarking?

It's already the default for gcc 12 with -O2 (looking further in the
docs, it uses the "very-cheap" vectorization cost model), so it may be
worth investigating what the effect of that was. I can't quickly find
the equivalent info for clang.

That being the case, if the difference you found was real, it must
have been due to unrolling loops. What changed in the binary?

https://gcc.gnu.org/gcc-12/changes.html
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html



Re: autovectorize page checksum code included elsewhere

From
Nathan Bossart
Date:
On Sat, Nov 11, 2023 at 07:38:59PM +0700, John Naylor wrote:
> On Tue, Nov 7, 2023 at 9:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Separately, I'm wondering whether we should consider using CFLAGS_VECTORIZE
>> on the whole tree.  Commit fdea253 seems to be responsible for introducing
>> this targeted autovectorization strategy, and AFAICT this was just done to
>> minimize the impact elsewhere while optimizing page checksums.  Are there
>> fundamental problems with adding CFLAGS_VECTORIZE everywhere?  Or is it
>> just waiting on someone to do the analysis/benchmarking?
> 
> It's already the default for gcc 12 with -O2 (looking further in the
> docs, it uses the "very-cheap" vectorization cost model), so it may be
> worth investigating what the effect of that was. I can't quickly find
> the equivalent info for clang.

My x86 machine is using gcc 9.4.0, which isn't even aware of "very-cheap".
I don't see any difference with any of the cost models, though.  It isn't
until I add -O3 that I see things like inlining pg_checksum_block into
pg_checksum_page.  -O3 is generating far more SSE2 instructions, too.

I'll have to check whether gcc 12 is finding anything else within Postgres
to autovectorize with it's "very-cheap" cost model...

> That being the case, if the difference you found was real, it must
> have been due to unrolling loops. What changed in the binary?

For gcc 9.4.0 on x86, the autovectorization flag alone indeed makes no
difference, while the loop unrolling one does.  For Apple clang 14.0.0 on
an M2, both flags seem to generate very different machine code.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: autovectorize page checksum code included elsewhere

From
Andres Freund
Date:
Hi,

On 2023-11-06 20:47:34 -0600, Nathan Bossart wrote:
> Separately, I'm wondering whether we should consider using CFLAGS_VECTORIZE
> on the whole tree.  Commit fdea253 seems to be responsible for introducing
> this targeted autovectorization strategy, and AFAICT this was just done to
> minimize the impact elsewhere while optimizing page checksums.  Are there
> fundamental problems with adding CFLAGS_VECTORIZE everywhere?  Or is it
> just waiting on someone to do the analysis/benchmarking?

Historically sometimes vectorization ended up hurting in a bunch of
places. But I think that was in the gcc 4 era, which long has
passed.

IME these days using -O3 yields decent improvements over -O2 when used tree
wide - even if there are perhaps a few isolated cases where the code is a bit
worse, they're far outweighed by the improved code.

Compile time wise it's noticeably slower, but not catastrophically so. On an
older but decent laptop, while on battery:

O2:
800.29user 41.99system 0:59.17elapsed 1423%CPU (0avgtext+0avgdata 282324maxresident)k
152inputs+4408176outputs (95major+13359282minor)pagefaults 0swaps

O3:
911.80user 44.71system 1:06.79elapsed 1431%CPU (0avgtext+0avgdata 278660maxresident)k
82624inputs+4571480outputs (571major+14004898minor)pagefaults 0swaps

Greetings,

Andres Freund



Re: autovectorize page checksum code included elsewhere

From
John Naylor
Date:
On Tue, Nov 7, 2023 at 9:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Presently, we ask compilers to autovectorize checksum.c and numeric.c.  The
> page checksum code actually lives in checksum_impl.h, and checksum.c just
> includes it.  But checksum_impl.h is also used in pg_upgrade/file.c and
> pg_checksums.c, and since we don't ask compilers to autovectorize those
> files, the page checksum code may remain un-vectorized.

Poking in those files a bit, I also see references to building with
SSE 4.1. Maybe that's an avenue that we should pursue? (an indirect
function call is surely worth it for page-sized data)



Re: autovectorize page checksum code included elsewhere

From
Ants Aasma
Date:
On Wed, 22 Nov 2023 at 11:44, John Naylor <johncnaylorls@gmail.com> wrote:
>
> On Tue, Nov 7, 2023 at 9:47 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >
> > Presently, we ask compilers to autovectorize checksum.c and numeric.c.  The
> > page checksum code actually lives in checksum_impl.h, and checksum.c just
> > includes it.  But checksum_impl.h is also used in pg_upgrade/file.c and
> > pg_checksums.c, and since we don't ask compilers to autovectorize those
> > files, the page checksum code may remain un-vectorized.
>
> Poking in those files a bit, I also see references to building with
> SSE 4.1. Maybe that's an avenue that we should pursue? (an indirect
> function call is surely worth it for page-sized data)

For reference, executing the page checksum 10M times on a AMD 3900X CPU:

clang-14 -O2                 4.292s (17.8 GiB/s)
clang-14 -O2 -msse4.1        2.859s (26.7 GiB/s)
clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s)

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com



Re: autovectorize page checksum code included elsewhere

From
Nathan Bossart
Date:
On Wed, Nov 22, 2023 at 02:54:13PM +0200, Ants Aasma wrote:
> On Wed, 22 Nov 2023 at 11:44, John Naylor <johncnaylorls@gmail.com> wrote:
>> Poking in those files a bit, I also see references to building with
>> SSE 4.1. Maybe that's an avenue that we should pursue? (an indirect
>> function call is surely worth it for page-sized data)

Yes, I think we should, but we also need to be careful not to hurt
performance on platforms that aren't able to benefit [0] [1].

There are a couple of other threads about adding support for newer
instructions [2] [3], and properly detecting the availability of these
instructions seems to be a common obstacle.  We have a path forward for
stuff that's already using a runtime check (e.g., CRC32C), but I think
we're still trying to figure out what to do for things that must be inlined
(e.g., simd.h).

One half-formed idea I have is to introduce some sort of ./configure flag
that enables all the newer instructions that your CPU understands.  It
would also remove any existing runtime checks.  This option would make it
easy to take advantage of the newer instructions if you are building
Postgres for only your machine (or others just like it).

> For reference, executing the page checksum 10M times on a AMD 3900X CPU:
> 
> clang-14 -O2                 4.292s (17.8 GiB/s)
> clang-14 -O2 -msse4.1        2.859s (26.7 GiB/s)
> clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s)

Nice.  I've noticed similar improvements with AVX2 intrinsics in simd.h.

[0] https://postgr.es/m/2613682.1698779776%40sss.pgh.pa.us
[1] https://postgr.es/m/36329.1699325578%40sss.pgh.pa.us
[2] https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
[3] https://postgr.es/m/DB9PR08MB6991329A73923BF8ED4B3422F5DBA@DB9PR08MB6991.eurprd08.prod.outlook.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: autovectorize page checksum code included elsewhere

From
John Naylor
Date:
On Thu, Nov 23, 2023 at 1:49 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Nov 22, 2023 at 02:54:13PM +0200, Ants Aasma wrote:
> > On Wed, 22 Nov 2023 at 11:44, John Naylor <johncnaylorls@gmail.com> wrote:
> >> Poking in those files a bit, I also see references to building with
> >> SSE 4.1. Maybe that's an avenue that we should pursue? (an indirect
> >> function call is surely worth it for page-sized data)
>
> Yes, I think we should, but we also need to be careful not to hurt
> performance on platforms that aren't able to benefit [0] [1].

Well, yes (see popcount using a direct function call on non-x86), but
I don't think it's as important for page-sized data. Also, sse4.1 is
~10 years old, I think.

> There are a couple of other threads about adding support for newer
> instructions [2] [3], and properly detecting the availability of these
> instructions seems to be a common obstacle.  We have a path forward for
> stuff that's already using a runtime check (e.g., CRC32C), but I think
> we're still trying to figure out what to do for things that must be inlined
> (e.g., simd.h).
>
> One half-formed idea I have is to introduce some sort of ./configure flag
> that enables all the newer instructions that your CPU understands.

That's not doable, but we should consider taking advantage of
x86-64-v2, which RedHat 9 builds with. That would allow inlining CRC
and popcount there. Not sure how to detect that easily.



Re: autovectorize page checksum code included elsewhere

From
Nathan Bossart
Date:
On Thu, Nov 23, 2023 at 05:50:48PM +0700, John Naylor wrote:
> On Thu, Nov 23, 2023 at 1:49 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> One half-formed idea I have is to introduce some sort of ./configure flag
>> that enables all the newer instructions that your CPU understands.
> 
> That's not doable,

It's not?

> but we should consider taking advantage of
> x86-64-v2, which RedHat 9 builds with. That would allow inlining CRC
> and popcount there.

Maybe we have something like --with-isa-level where you can specify
x86-64-v[1-4] or "auto" to mean "build whatever the current machine can
handle."  I can imagine packagers requiring v2 these days (it's probably
worth asking them), and that would not only allow compiling in SSE 4.2 on
many more machines, but it would also open up a path to supporting
AVX2/AVX512 and beyond.

> Not sure how to detect that easily.

I briefly looked around and didn't see a portable way to do so.  We might
have to exhaustively check the features, which doesn't seem like it'd be
too bad for x86_64, but I haven't looked too closely at other
architectures.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: autovectorize page checksum code included elsewhere

From
John Naylor
Date:
On Thu, Nov 23, 2023 at 11:51 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Nov 23, 2023 at 05:50:48PM +0700, John Naylor wrote:
> > On Thu, Nov 23, 2023 at 1:49 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> One half-formed idea I have is to introduce some sort of ./configure flag
> >> that enables all the newer instructions that your CPU understands.
> >
> > That's not doable,
>
> It's not?

What exactly would our build system do differently with e.g.
"-march=native" (which is already possible for people who compile
their own binaries, via CFLAGS), if I understand you correctly?

> > but we should consider taking advantage of
> > x86-64-v2, which RedHat 9 builds with. That would allow inlining CRC
> > and popcount there.
>
> Maybe we have something like --with-isa-level where you can specify
> x86-64-v[1-4] or "auto" to mean "build whatever the current machine can
> handle."

That could work, but with the below OS's, it should work
automatically. Also, we may not be many years off from the day we can
move our baseline as well, such that older buildfarm members (if we
have any) would need to pass --disable-isa-extensions, but that may be
pushing things too much for too little benefit. *

> I can imagine packagers requiring v2 these days (it's probably
> worth asking them), and that would not only allow compiling in SSE 4.2 on
> many more machines, but it would also open up a path to supporting
> AVX2/AVX512 and beyond.

A brief look found these OS's are moving / have moved to x86-64-v2:

Redhat 9 [1][2]
OpenSuse ALP [3]
OpenSuse Tumbleweed [4]

Debian considers it a bug if the package fails to build with
x86-64-v2, but they haven't changed their baseline requirement. [5]

> > Not sure how to detect that easily.
>
> I briefly looked around and didn't see a portable way to do so.  We might
> have to exhaustively check the features, which doesn't seem like it'd be
> too bad for x86_64, but I haven't looked too closely at other
> architectures.

Sorry, I wasn't clear, I mean: detect that a packager passed
"-march=x86_64-v2" in the CFLAGS, so that a symbol in header files
would cause inlining of functions containing certain intrinsics.
Exhaustively checking features defeats the purpose of having an
industry-standard shorthand, and says nothing about what the package
does or does not require of the target machine.

* Note: I have seen the threads with the idea of compiling multiple
entire binaries, and switching at postmaster start. I think it's a
good idea, but I also suspect detecting flags from the packager is an
easier intermediate step.

[1]
https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-linux-9-for-the-x86-64-v2-microarchitecture-level
[2] https://access.redhat.com/solutions/6833751
[3] https://news.opensuse.org/2022/09/26/alp-architecture-baselevel-x86_64-v2/
[4] https://news.opensuse.org/2022/11/28/tw-to-roll-out-mitigation-plan-advance-microarchitecture/
[5] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=983926



Re: autovectorize page checksum code included elsewhere

From
John Naylor
Date:
On Thu, Nov 23, 2023 at 1:49 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Nov 22, 2023 at 02:54:13PM +0200, Ants Aasma wrote:
> > For reference, executing the page checksum 10M times on a AMD 3900X CPU:
> >
> > clang-14 -O2                 4.292s (17.8 GiB/s)
> > clang-14 -O2 -msse4.1        2.859s (26.7 GiB/s)
> > clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s)
>
> Nice.  I've noticed similar improvements with AVX2 intrinsics in simd.h.

If you're thinking to support AVX2 anywhere, I'd start with checksum
first. Much less code to review, and less risk.



Re: autovectorize page checksum code included elsewhere

From
Nathan Bossart
Date:
On Sat, Nov 25, 2023 at 02:09:14PM +0700, John Naylor wrote:
> On Thu, Nov 23, 2023 at 11:51 PM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>>
>> On Thu, Nov 23, 2023 at 05:50:48PM +0700, John Naylor wrote:
>> > On Thu, Nov 23, 2023 at 1:49 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> >> One half-formed idea I have is to introduce some sort of ./configure flag
>> >> that enables all the newer instructions that your CPU understands.
>> >
>> > That's not doable,
>>
>> It's not?
> 
> What exactly would our build system do differently with e.g.
> "-march=native" (which is already possible for people who compile
> their own binaries, via CFLAGS), if I understand you correctly?

It would probably just be an easier way of doing that than adjusting COPT
in src/Makefile.custom.

>> Maybe we have something like --with-isa-level where you can specify
>> x86-64-v[1-4] or "auto" to mean "build whatever the current machine can
>> handle."
> 
> That could work, but with the below OS's, it should work
> automatically. Also, we may not be many years off from the day we can
> move our baseline as well, such that older buildfarm members (if we
> have any) would need to pass --disable-isa-extensions, but that may be
> pushing things too much for too little benefit. *

You are probably right.  I guess I'm wondering whether we need to make all
this configurable.  Maybe we could get away with moving our baseline to v2
soon, but if we'd also like to start adding AVX2 enhancements (and I think
we will), I'm assuming we'll want to provide an easy way for users to
declare that they are building for v3+ CPUs.

>> > Not sure how to detect that easily.
>>
>> I briefly looked around and didn't see a portable way to do so.  We might
>> have to exhaustively check the features, which doesn't seem like it'd be
>> too bad for x86_64, but I haven't looked too closely at other
>> architectures.
> 
> Sorry, I wasn't clear, I mean: detect that a packager passed
> "-march=x86_64-v2" in the CFLAGS, so that a symbol in header files
> would cause inlining of functions containing certain intrinsics.
> Exhaustively checking features defeats the purpose of having an
> industry-standard shorthand, and says nothing about what the package
> does or does not require of the target machine.

I'm not sure why I thought checking each feature might be necessary.
--with-isa-level could essentially just be an alias for adding all the
CFLAGS for the extensions provided at that level, and --with-isa-level=auto
would just mean -march=native.  With those flags set, the ./configure
checks would succeed with the base set of compiler flags passed in, which
could be used as a heuristic for inlining (like CRC32C does today).

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: autovectorize page checksum code included elsewhere

From
Nathan Bossart
Date:
On Mon, Nov 27, 2023 at 03:21:53PM -0600, Nathan Bossart wrote:
> On Sat, Nov 25, 2023 at 02:09:14PM +0700, John Naylor wrote:
>> Sorry, I wasn't clear, I mean: detect that a packager passed
>> "-march=x86_64-v2" in the CFLAGS, so that a symbol in header files
>> would cause inlining of functions containing certain intrinsics.
>> Exhaustively checking features defeats the purpose of having an
>> industry-standard shorthand, and says nothing about what the package
>> does or does not require of the target machine.
> 
> I'm not sure why I thought checking each feature might be necessary.
> --with-isa-level could essentially just be an alias for adding all the
> CFLAGS for the extensions provided at that level, and --with-isa-level=auto
> would just mean -march=native.  With those flags set, the ./configure
> checks would succeed with the base set of compiler flags passed in, which
> could be used as a heuristic for inlining (like CRC32C does today).

Or, perhaps you mean removing those ./configure checks completely and
assuming that the compiler knows about the intrinsics required for the
specified ISA level...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: autovectorize page checksum code included elsewhere

From
Andres Freund
Date:
Hi,

On 2023-11-25 14:09:14 +0700, John Naylor wrote:
> * Note: I have seen the threads with the idea of compiling multiple
> entire binaries, and switching at postmaster start. I think it's a
> good idea, but I also suspect detecting flags from the packager is an
> easier intermediate step.

It's certainly an easier incremental step - but will it get us all that far?
Other architectures have similar issues, e.g. ARM with ARMv8.1-A having much
more scalable atomic instructions than plain ARMv8.0. And even on x86-64,
relying on distros to roll out new minimum x86_64-v2 doesn't get us that far -
we'd be up to a uarch from ~2010. And I doubt that they'll raise the bar to v4
in the next few years, so we'll leave years of improvements on the table. And
even PG specific repos can't just roll out a hard requirement for a new, more
modern, uarch that will crash some time after startup on some hardware.

Greetings,

Andres Freund



Re: autovectorize page checksum code included elsewhere

From
John Naylor
Date:
On Tue, Nov 28, 2023 at 7:51 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2023-11-25 14:09:14 +0700, John Naylor wrote:
> > * Note: I have seen the threads with the idea of compiling multiple
> > entire binaries, and switching at postmaster start. I think it's a
> > good idea, but I also suspect detecting flags from the packager is an
> > easier intermediate step.
>
> It's certainly an easier incremental step - but will it get us all that far?

Granted, not much.

(TBH, I'm not sure how to design the multiple-binaries idea, but
surely we're not the first project to think of this...)

On Tue, Nov 28, 2023 at 4:21 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> soon, but if we'd also like to start adding AVX2 enhancements (and I think
> we will), I'm assuming we'll want to provide an easy way for users to
> declare that they are building for v3+ CPUs.

Yeah, I remember now I saw instruction selection change a single shift
with -v3 which made the UTF-8 DFA significantly faster:

https://www.postgresql.org/message-id/CAFBsxsHR08mHEf06PvrMRstfcyPJLwF69g0r1pvRrxWD4GEVoQ%40mail.gmail.com

I imagine a number of places would get automatic improvements, and I
think others have said the same thing.

On Tue, Nov 28, 2023 at 5:26 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> > I'm not sure why I thought checking each feature might be necessary.
> > --with-isa-level could essentially just be an alias for adding all the
> > CFLAGS for the extensions provided at that level, and --with-isa-level=auto
> > would just mean -march=native.  With those flags set, the ./configure
> > checks would succeed with the base set of compiler flags passed in, which
> > could be used as a heuristic for inlining (like CRC32C does today).
>
> Or, perhaps you mean removing those ./configure checks completely and
> assuming that the compiler knows about the intrinsics required for the
> specified ISA level...

With the multiple-binaries, we might be able to assume, since it'll be
opt-in (default is just use the baseline), but I'm not sure. And to
avoid needing a newish compiler, than we could probably just use the
equivalent, e.g. for -v2:

-march=x86-64 -mmmx -msse -msse2 -mfxsr -msahf -mcx16 -mpopcnt -msse3
-msse4.1 -msse4.2 -mssse3

...and it seems we'll want to make something up for Arm.



Re: autovectorize page checksum code included elsewhere

From
Nathan Bossart
Date:
I don't think anything discussed in this thread is ready for v17, so I am
going to punt it to v18.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com