Thread: Changing default -march landscape

Changing default -march landscape

From
Thomas Munro
Date:
Hi,

David R and I were discussing vectorisation and microarchitectures and
what you can expect the target microarchitecture to be these days, and
it seemed like some of our choices are not really very
forward-looking.

Distros targeting x86-64 traditionally assumed the original AMD64 K8
instruction set, so if we want to use newer instructions we use
various configure or runtime checks to see if that's safe.

Recent GCC and Clang versions understand -march=x86-64-v{2,3,4}[1].
RHEL9 and similar and SUSE tumbleweed now require x86-64-v2, and IIUC
they changed the -march default to -v2 in their build of GCC, and I
think Ubuntu has something in the works perhaps for -v3[2].

Some of our current tricks won't won't take proper advantage of that:
we'll still access POPCNT through a function pointer!  I was wondering
how to do it.  One idea that seems kinda weird is to try $(CC) -S
test_builtin_popcount.c, and then grepping for POPCNT in
test_builtin_popcount.s!  I assume we don't want to use
__builtin_popcount() if it doesn't generate the instruction (using the
compiler flags provided or default otherwise), because on a more
conservative distro we'll use GCC/Clang's fallback code, instead of
our own runtime-checked POPCNT-instruction-through-a-function-pointer.
(Presumably we believed that to be better.)  Being able to use
__builtin_popcount() directly without any function pointer nonsense is
obviously faster, but also automatically vectorisable.

That's not like the CRC32 instruction checks we have, because those
either work or don't work with default compiler flags, but for POPCNT
it always works but might general fallback code instead of the desired
instruction so you have to inspect what it generates.

FWIW Windows 11 on x86 requires the POPCNT instruction to boot.
Windows 10 EOL is October next year so we can use MSVC's intrinsic
without a function pointer if we just wait :-)

All ARM64 bit systems have CNT, but we don't use it!  Likewise for all
modern POWER (8+) and SPARC chips that any OS can actually run on
these days.  For RISCV it's part of the bit manipulation option, but
we're already relying on that by detecting and using other
pg_bitutils.h builtins.

So I think we should probably just use the builtin directly
everywhere, except on x86 where we should either check if it generates
the instruction we want, OR, if we can determine that the modern
GCC/Clangfallback code is actually faster than our function pointer
hop, then maybe we should just always use it even there, after
checking that it exists.

[1] https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels
[2] https://ubuntu.com/blog/optimising-ubuntu-performance-on-amd64-architecture



Re: Changing default -march landscape

From
Nathan Bossart
Date:
On Thu, Jun 13, 2024 at 11:11:56AM +1200, Thomas Munro wrote:
> David R and I were discussing vectorisation and microarchitectures and
> what you can expect the target microarchitecture to be these days, and
> it seemed like some of our choices are not really very
> forward-looking.
> 
> Distros targeting x86-64 traditionally assumed the original AMD64 K8
> instruction set, so if we want to use newer instructions we use
> various configure or runtime checks to see if that's safe.
> 
> Recent GCC and Clang versions understand -march=x86-64-v{2,3,4}[1].
> RHEL9 and similar and SUSE tumbleweed now require x86-64-v2, and IIUC
> they changed the -march default to -v2 in their build of GCC, and I
> think Ubuntu has something in the works perhaps for -v3[2].
> 
> Some of our current tricks won't won't take proper advantage of that:
> we'll still access POPCNT through a function pointer!

This is perhaps only tangentially related, but I've found it really
difficult to avoid painting ourselves into a corner with this stuff.  Let's
use the SSE 4.2 CRC32C code as an example.  Right now, if your default
compiler flags indicate support for SSE 4.2 (which I'll note can be assumed
with x86-64-v2), we'll use it unconditionally, no function pointer
required.  If additional compiler flags happen to convince the compiler to
generate SSE 4.2 code, we'll instead build both a fallback version and the
SSE version, and then we'll use a function pointer to direct to whatever we
detect is available on the CPU when the server starts.

Now, let's say we require x86-64-v2.  Once we have that, we can avoid the
function pointer on many more x86 machines.  While that sounds great, now
we have a different problem.  If someone wants to add, say, AVX-512 support
[0], which is a much newer instruction set, we'll need to use the function
pointer again.  And we're back where we started.  We could instead just ask
folks to compile with --march=native, but then these optimizations are only
available for a subset of users until we decide the instructions are
standard enough 20 years from now.

The idea that's been floating around recently is to build a bunch of
different versions of Postgres and to choose one on startup based on what
the CPU supports.  That seems like quite a lot of work, and it'll increase
the size of the builds quite a bit, but it at least doesn't have the
aforementioned problem.

Sorry if I just rambled on about something unrelated, but your message had
enough keywords to get me thinking about this again.

[0] https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192%40BL1PR11MB5304.namprd11.prod.outlook.com

-- 
nathan



Re: Changing default -march landscape

From
Thomas Munro
Date:
On Thu, Jun 13, 2024 at 1:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Now, let's say we require x86-64-v2.  Once we have that, we can avoid the
> function pointer on many more x86 machines.  While that sounds great, now
> we have a different problem.  If someone wants to add, say, AVX-512 support
> [0], which is a much newer instruction set, we'll need to use the function
> pointer again.  And we're back where we started.  We could instead just ask
> folks to compile with --march=native, but then these optimizations are only
> available for a subset of users until we decide the instructions are
> standard enough 20 years from now.

The way I think about it, it's not our place to require anything (I
mean, we can't literally put -march=XXX into our build files, or if we
do the Debian et al maintainers will have to remove them by local
patch because they are in charge of what the baseline is for their
distro), but we should do the best thing possible when people DO build
with modern -march.  I would assume for example that Amazon Linux is
set up to use a default -march that targets the actual minimum
microarch level on AWS hosts.  I guess what I'm pointing out here is
that the baseline is (finally!) moving on common distributions, and
yet we've coded things in a way that doesn't benefit...

> The idea that's been floating around recently is to build a bunch of
> different versions of Postgres and to choose one on startup based on what
> the CPU supports.  That seems like quite a lot of work, and it'll increase
> the size of the builds quite a bit, but it at least doesn't have the
> aforementioned problem.

I guess another idea would be for the PGDG packagers or someone else
interested in performance to create repos with binaries built for
these microarch levels and users can research what they need.  The new
-v2 etc levels are a lot more practical than the microarch names and
individual features...



Re: Changing default -march landscape

From
Nathan Bossart
Date:
On Thu, Jun 13, 2024 at 01:20:17PM +1200, Thomas Munro wrote:
> The way I think about it, it's not our place to require anything (I
> mean, we can't literally put -march=XXX into our build files, or if we
> do the Debian et al maintainers will have to remove them by local
> patch because they are in charge of what the baseline is for their
> distro), but we should do the best thing possible when people DO build
> with modern -march.  I would assume for example that Amazon Linux is
> set up to use a default -march that targets the actual minimum
> microarch level on AWS hosts.  I guess what I'm pointing out here is
> that the baseline is (finally!) moving on common distributions, and
> yet we've coded things in a way that doesn't benefit...

That's true, but my point is that as soon as we start avoiding function
pointers more commonly, it becomes difficult to justify adding them back in
order to support new instruction sets.  Should we just compile in the SSE
4.2 version, or should we take a chance on AVX-512 with the function
pointer?

>> The idea that's been floating around recently is to build a bunch of
>> different versions of Postgres and to choose one on startup based on what
>> the CPU supports.  That seems like quite a lot of work, and it'll increase
>> the size of the builds quite a bit, but it at least doesn't have the
>> aforementioned problem.
> 
> I guess another idea would be for the PGDG packagers or someone else
> interested in performance to create repos with binaries built for
> these microarch levels and users can research what they need.  The new
> -v2 etc levels are a lot more practical than the microarch names and
> individual features...

Heartily agreed.

-- 
nathan



Re: Changing default -march landscape

From
Peter Eisentraut
Date:
On 13.06.24 04:00, Nathan Bossart wrote:
> That's true, but my point is that as soon as we start avoiding function
> pointers more commonly, it becomes difficult to justify adding them back in
> order to support new instruction sets.  Should we just compile in the SSE
> 4.2 version, or should we take a chance on AVX-512 with the function
> pointer?
> 
>>> The idea that's been floating around recently is to build a bunch of
>>> different versions of Postgres and to choose one on startup based on what
>>> the CPU supports.  That seems like quite a lot of work, and it'll increase
>>> the size of the builds quite a bit, but it at least doesn't have the
>>> aforementioned problem.
>>
>> I guess another idea would be for the PGDG packagers or someone else
>> interested in performance to create repos with binaries built for
>> these microarch levels and users can research what they need.  The new
>> -v2 etc levels are a lot more practical than the microarch names and
>> individual features...
> 
> Heartily agreed.

One thing that is perhaps not clear (to me?) is how much this matters 
and how much of it matters.  Obviously, we know that it matters some, 
otherwise we'd not be working on it.  But does it, like, matter only 
with checksums, or with thousands of partitions, or with many CPUs, or 
certain types of indexes, etc.?

If we're going to, say, create some recommendations for packagers around 
this, how are they supposed to determine the tradeoffs?  It might be 
easy for a packager to set some slightly-higher -march flag that is in 
line with their distro's policies, but it would probably be a lot more 
work to create separate binaries or a separate repository for, say, 
moving from SSE-something to AVX-something.  And how are they supposed 
to decide that, and how are they supposed to communicate that to their 
users?  (And how can we get different packagers to make somewhat 
consistent decisions around this?)

We have in a somewhat similar case quite clearly documented that without 
native spinlock support everything will be terrible.  And there is 
probably some information out there that without certain CPU support 
checksum performance will be terrible.  But beyond that we probably 
don't have much.




Re: Changing default -march landscape

From
Magnus Hagander
Date:
On Thu, Jun 13, 2024 at 9:41 AM Peter Eisentraut <peter@eisentraut.org> wrote:
On 13.06.24 04:00, Nathan Bossart wrote:
> That's true, but my point is that as soon as we start avoiding function
> pointers more commonly, it becomes difficult to justify adding them back in
> order to support new instruction sets.  Should we just compile in the SSE
> 4.2 version, or should we take a chance on AVX-512 with the function
> pointer?
>
>>> The idea that's been floating around recently is to build a bunch of
>>> different versions of Postgres and to choose one on startup based on what
>>> the CPU supports.  That seems like quite a lot of work, and it'll increase
>>> the size of the builds quite a bit, but it at least doesn't have the
>>> aforementioned problem.
>>
>> I guess another idea would be for the PGDG packagers or someone else
>> interested in performance to create repos with binaries built for
>> these microarch levels and users can research what they need.  The new
>> -v2 etc levels are a lot more practical than the microarch names and
>> individual features...
>
> Heartily agreed.

One thing that is perhaps not clear (to me?) is how much this matters
and how much of it matters.  Obviously, we know that it matters some,
otherwise we'd not be working on it.  But does it, like, matter only
with checksums, or with thousands of partitions, or with many CPUs, or
certain types of indexes, etc.?

If we're going to, say, create some recommendations for packagers around
this, how are they supposed to determine the tradeoffs?  It might be
easy for a packager to set some slightly-higher -march flag that is in
line with their distro's policies, but it would probably be a lot more
work to create separate binaries or a separate repository for, say,
moving from SSE-something to AVX-something.  And how are they supposed
to decide that, and how are they supposed to communicate that to their
users?  (And how can we get different packagers to make somewhat
consistent decisions around this?)

We have in a somewhat similar case quite clearly documented that without
native spinlock support everything will be terrible.  And there is
probably some information out there that without certain CPU support
checksum performance will be terrible.  But beyond that we probably
don't have much.

Yeah, I think it's completely unreasonable  to push this on packagers and just say "this is your problem now". If we do that, we can assume the only people to get any benefit from these optimizations are those that use a fully managed cloud service like azure or RDS.

They can do it, but we need to tell them how and when. And if we intend for packagers to be part of the solution we need to explicitly bring them into the discussion of how to do it, at a fairly early stage (and no, we can't expect them to follow every thread on hackers).

--

Re: Changing default -march landscape

From
Thomas Munro
Date:
On Thu, Jun 13, 2024 at 8:15 PM Magnus Hagander <magnus@hagander.net> wrote:
> Yeah, I think it's completely unreasonable  to push this on packagers and just say "this is your problem now". If we
dothat, we can assume the only people to get any benefit from these optimizations are those that use a fully managed
cloudservice like azure or RDS. 

It would also benefit distros that have decided to move their baseline
micro-arch level right now, probably without any additional action
from the maintainers assuming that gcc defaults to -march=*-v2 etc.
The cloud vendors' internal distros aren't really special in that
regard are they?

Hmm, among Linux distros, maybe it's really only Debian that isn't
moving or talking about moving the baseline yet?  (Are they?)

> They can do it, but we need to tell them how and when. And if we intend for packagers to be part of the solution we
needto explicitly bring them into the discussion of how to do it, at a fairly early stage (and no, we can't expect them
tofollow every thread on hackers). 

OK let me CC Christoph and ask the question this way: hypothetically,
if RHEL users' PostgreSQL packages became automatically faster than
Debian users' packages because of default -march choice in the
toolchain, what would the Debian project think about that, and what
should we do about it?  The options discussed so far were:

1.  Consider a community apt repo that is identical to the one except
targeting the higher microarch levels, that users can point a machine
to if they want to.
2.  Various ideas for shipping multiple versions of the code at a
higher granularity than we're doing to day (a callback for a single
instruction!  the opposite extreme being the whole executable +
libraries), probably using some of techniques mentioned at
https://wiki.debian.org/InstructionSelection.

I would guess that 1 is about 100x easier but I haven't looked into it.



Re: Changing default -march landscape

From
Christoph Berg
Date:
Hi,

sorry for the delayed reply, I suck at prioritizing things.

Re: Thomas Munro
> OK let me CC Christoph and ask the question this way: hypothetically,
> if RHEL users' PostgreSQL packages became automatically faster than
> Debian users' packages because of default -march choice in the
> toolchain, what would the Debian project think about that, and what
> should we do about it?  The options discussed so far were:
> 
> 1.  Consider a community apt repo that is identical to the one except
> targeting the higher microarch levels, that users can point a machine
> to if they want to.

There are sub-variations of that: Don't use -march in Debian for
strict baseline compatiblity, but use -march=something in
apt.postgresql.org; bump to -march=x86-64-v2 for the server build (but
not libpq and psql) saying that PG servers need better hardware; ...

I'd rather want to avoid adding yet another axis to the matrix we
target on apt.postgresql.org, it's already complex enough. So ideally
there should be only one server-build. Or if we decide it's worth to
have several, extension builds should still be compatible with either.

> 2.  Various ideas for shipping multiple versions of the code at a
> higher granularity than we're doing to day (a callback for a single
> instruction!  the opposite extreme being the whole executable +
> libraries), probably using some of techniques mentioned at
> https://wiki.debian.org/InstructionSelection.

I don't have enough experience with that to say anything about the
trade-offs, or if the online instruction selectors themselves add too
much overhead.

Not to mention that testing things against all instruction variants is
probably next to impossible in practice.

> I would guess that 1 is about 100x easier but I haven't looked into it.

Are there any numbers about what kind of speedup we could expect?

If the online selection isn't feasible/worthwhile, I'd be willing to
bump the baseline for the server packages. There are already packages
doing that, and there's even infrastructure in the "isa-support"
package that lets packages declare a dependency on CPU features.

Or Debian might just bump the baseline. PostgreSQL asking for it might
just be the reason we wanted to hear to make it happen.

Christoph



Re: Changing default -march landscape

From
Christoph Berg
Date:
Re: To Thomas Munro
> Or Debian might just bump the baseline. PostgreSQL asking for it might
> just be the reason we wanted to hear to make it happen.

Which level would PostgreSQL specifically want? x86-64-v2 or even
x86-64-v3?

Christoph



Re: Changing default -march landscape

From
Andres Freund
Date:
Hi,

Reply triggered by https://postgr.es/m/ZqmRYh3iikm1Kh3D%40nathan


On 2024-06-12 20:09:45 -0500, Nathan Bossart wrote:
> This is perhaps only tangentially related, but I've found it really
> difficult to avoid painting ourselves into a corner with this stuff.  Let's
> use the SSE 4.2 CRC32C code as an example.  Right now, if your default
> compiler flags indicate support for SSE 4.2 (which I'll note can be assumed
> with x86-64-v2), we'll use it unconditionally, no function pointer
> required.  If additional compiler flags happen to convince the compiler to
> generate SSE 4.2 code, we'll instead build both a fallback version and the
> SSE version, and then we'll use a function pointer to direct to whatever we
> detect is available on the CPU when the server starts.

> Now, let's say we require x86-64-v2.  Once we have that, we can avoid the
> function pointer on many more x86 machines.  While that sounds great, now
> we have a different problem.  If someone wants to add, say, AVX-512 support
> [0], which is a much newer instruction set, we'll need to use the function
> pointer again.  And we're back where we started.  We could instead just ask
> folks to compile with --march=native, but then these optimizations are only
> available for a subset of users until we decide the instructions are
> standard enough 20 years from now.

I don't think this is that big an issue:

So much of the avx512 stuff is only available on an almost arbitrary seeming
set of platforms, so we'll need runtime tests approximately forever. Which
also means we need a fairly standard pattern for deciding whether the runtime
dispatch is worth the cost.  That's pretty independent of whether we e.g. want
to implement pg_popcount64 without runtime dispatch, it'd not make sense to
use avx512 for that anyway.


> The idea that's been floating around recently is to build a bunch of
> different versions of Postgres and to choose one on startup based on what
> the CPU supports.  That seems like quite a lot of work, and it'll increase
> the size of the builds quite a bit, but it at least doesn't have the
> aforementioned problem.

I still think that's a good idea - but I don't think it's gonna deal well with
the various avx512 "features". The landscape of what's supported on what CPU
is so comically complicated that there's no way to build separate versions for
everything.



We can hide most of the dispatch cost in a static inline function that only
does the runtime test if size is large enough - the size is a compile time
constant most of the time, which optimizes away the dispatch cost most of the
time.  And even if not, an inlined runtime branch is a *lot* cheaper than an
indirect function call.


I think this should be something *roughly* like this:


/* shared between all cpu-type dependent code */

#define PGCPUCAP_INIT        (1 << 0)
#define PGCPUCAP_POPCNT        (1 << 1)
#define PGCPUCAP_VPOPCNT    (1 << 2)
#define PGCPUCAP_CRC32C        (1 << 3)
...

extern uint32 pg_cpucap;  /* possibly an bool array would be better */
extern void pg_cpucap_initialize(void);



static inline int
pg_popcount64(uint64 word)
{
    Assert(pg_cpucap & PGCPUCAP_INIT);

#if defined(HAVE_POPCNT64_COMPILETIME) || defined(HAVE_POPCNT64_RUNTIME)

#if defined(HAVE_POPCNT64_RUNTIME)
    if (pg_cpucap & PGCPUCAP_POPCNT64)
#endif
    {
        return pg_popcount64_fast(buf, bytes);
    }
    /* fall through if not available */
#else
    return pg_popcount64_slow(word);
#endif
}

/* badly guessed */
#define AVX512_THRESHOLD 64

static inline uint64
pg_popcount(const char *buf, int bytes)
{
    uint64 popcnt = 0;

    Assert(pg_cpucap & PGCPUCAP_INIT);

    /*
     * Most of the times `bytes` will be a compile time constant and the
     * branches below can be optimized out.  Even if they can't, a branch or
     * three here is a lot cheaper than an indirect function call.
     */

#if defined(HAVE_AVX512_VPOPCNT_COMPILETIME) || defined(HAVE_AVX512_VPOPCNT_RUNTIME)
    if (unlikely(bytes >= AVX512_THRESHOLD))
    {
#if defined(HAVE_AVX512_VPOPCNT_RUNTIME)
        if (pg_cpucap & PGCPUCAP_VPOPCNT)
#else
        {
            return pg_popcount_avx512(buf, bytes);
        }
        /* if not available we fall through to the unrolled implementation */
#endif
    }
#endif /* HAVE_AVX512_VPOPCNT_* */

    /* XXX: should probably be implemented in separate function */
    if (bytes > 8)
    {
    while (bytes >= 8)
        {
            uint64 word;

            /*
             * Address could be unaligned, compiler will optimize this to a
             * plain [unaligned] load on most architectures.
             */
            memcpy(&word, buf, sizeof(uint64));

            /*
             * TODO: if compilers can't hoist the pg_cpucap check
             * automatically, we should do so "manually".
             */
            popcnt += pg_popcount64(word);

            buf += sizeof(uint64);
            bytes -= sizeof(uint64);
        }
    }

    /*
     * Handle tail, we can use the 64bit version by just loading the relevant
     * portion of the data into a wider register.
     */
    if (bytes > 0)
    {
        uint64 word = 0;

        Assert(bytes < 8);

        memcpy(&word, buf, bytes);

        popcnt += pg_popcount64_fast(word);
    }

    return popcnt;
}


Greetings,

Andres Freund



Re: Changing default -march landscape

From
Nathan Bossart
Date:
On Tue, Jul 30, 2024 at 07:39:18PM -0700, Andres Freund wrote:
> On 2024-06-12 20:09:45 -0500, Nathan Bossart wrote:
>> The idea that's been floating around recently is to build a bunch of
>> different versions of Postgres and to choose one on startup based on what
>> the CPU supports.  That seems like quite a lot of work, and it'll increase
>> the size of the builds quite a bit, but it at least doesn't have the
>> aforementioned problem.
> 
> I still think that's a good idea - but I don't think it's gonna deal well with
> the various avx512 "features". The landscape of what's supported on what CPU
> is so comically complicated that there's no way to build separate versions for
> everything.

Good point.

> We can hide most of the dispatch cost in a static inline function that only
> does the runtime test if size is large enough - the size is a compile time
> constant most of the time, which optimizes away the dispatch cost most of the
> time.  And even if not, an inlined runtime branch is a *lot* cheaper than an
> indirect function call.

I ended up doing precisely this for pg_popcount()/pg_popcount_masked(),
although not quite as sophisticated as what you propose below.  I'll look
into expanding on this strategy in v18.

-- 
nathan



Re: Changing default -march landscape

From
Andres Freund
Date:
Hi,

On 2024-07-30 21:59:44 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 07:39:18PM -0700, Andres Freund wrote:
> > We can hide most of the dispatch cost in a static inline function that only
> > does the runtime test if size is large enough - the size is a compile time
> > constant most of the time, which optimizes away the dispatch cost most of the
> > time.  And even if not, an inlined runtime branch is a *lot* cheaper than an
> > indirect function call.
> 
> I ended up doing precisely this for pg_popcount()/pg_popcount_masked(),
> although not quite as sophisticated as what you propose below.  I'll look
> into expanding on this strategy in v18.

I think you subsume that under "not quite as sophisticated", but just to make
clear:  The most important bits are to

a) do the dispatch in a header, without an indirect function call

b) implement intrinsic using stuff in a header if it's using a size argument
   or such, because that allows to compiler to optimize away size checks in
   the very common case of such an argument being a compile time constant.

Greetings,

Andres Freund