Thread: use ARM intrinsics in pg_lfind32() where available

use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
Hi hackers,

This is a follow-up for recent changes that optimized [sub]xip lookups in
XidInMVCCSnapshot() on Intel hardware [0] [1].  I've attached a patch that
uses ARM Advanced SIMD (Neon) intrinsic functions where available to speed
up the search.  The approach is nearly identical to the SSE2 version, and
the usual benchmark [2] shows similar improvements.

  writers  head  simd
  8        866   836
  16       849   833
  32       782   822
  64       846   833
  128      805   821
  256      722   739
  512      529   674
  768      374   608
  1024     268   522

I've tested the patch on a recent macOS (M1 Pro) and Amazon Linux
(Graviton2), and I've confirmed that the instructions aren't used on a
Linux/Intel machine.  I did add a new configure check to see if the
relevant intrinsics are available, but I didn't add a runtime check like
there is for the CRC instructions since the compilers I used support these
intrinsics by default.  (I don't think a runtime check would work very well
with the inline function, anyway.)  AFAICT these intrinsics are pretty
standard on aarch64, although IIUC the spec indicates that they are
technically optional.  I suspect that a simple check for "aarch64" would be
sufficient, but I haven't investigated the level of compiler support yet.

Thoughts?

[0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b6ef167
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=37a6e5d
[2] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: use ARM intrinsics in pg_lfind32() where available

From
Andres Freund
Date:
Hi,

On 2022-08-19 13:08:29 -0700, Nathan Bossart wrote:
> I've tested the patch on a recent macOS (M1 Pro) and Amazon Linux
> (Graviton2), and I've confirmed that the instructions aren't used on a
> Linux/Intel machine.  I did add a new configure check to see if the
> relevant intrinsics are available, but I didn't add a runtime check like
> there is for the CRC instructions since the compilers I used support these
> intrinsics by default.  (I don't think a runtime check would work very well
> with the inline function, anyway.)  AFAICT these intrinsics are pretty
> standard on aarch64, although IIUC the spec indicates that they are
> technically optional.  I suspect that a simple check for "aarch64" would be
> sufficient, but I haven't investigated the level of compiler support yet.

Are you sure there's not an appropriate define for us to use here instead of a
configure test? E.g.

echo|cc -dM -P -E -|grep -iE 'arm|aarch'
...
#define __AARCH64_SIMD__ 1
...
#define __ARM_NEON 1
#define __ARM_NEON_FP 0xE
#define __ARM_NEON__ 1
..

I strikes me as non-scalable to explicitly test all the simd instructions we'd
use.


The story for the CRC checks is different because those instructions often
aren't available with the default compilation flags and aren't guaranteed to
be available at runtime.

Regards,

Andres



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Fri, Aug 19, 2022 at 02:26:02PM -0700, Andres Freund wrote:
> Are you sure there's not an appropriate define for us to use here instead of a
> configure test? E.g.
> 
> echo|cc -dM -P -E -|grep -iE 'arm|aarch'
> ...
> #define __AARCH64_SIMD__ 1
> ...
> #define __ARM_NEON 1
> #define __ARM_NEON_FP 0xE
> #define __ARM_NEON__ 1
> ..
> 
> I strikes me as non-scalable to explicitly test all the simd instructions we'd
> use.

Thanks for the pointer.  GCC, Clang, and the Arm compiler all seem to
define __ARM_NEON, so here is a patch that uses that instead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Sat, Aug 20, 2022 at 5:28 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Aug 19, 2022 at 02:26:02PM -0700, Andres Freund wrote:
> > Are you sure there's not an appropriate define for us to use here instead of a
> > configure test? E.g.
> >
> > echo|cc -dM -P -E -|grep -iE 'arm|aarch'
> > ...
> > #define __AARCH64_SIMD__ 1
> > ...
> > #define __ARM_NEON 1
> > #define __ARM_NEON_FP 0xE
> > #define __ARM_NEON__ 1
> > ..
> >
> > I strikes me as non-scalable to explicitly test all the simd instructions we'd
> > use.
>
> Thanks for the pointer.  GCC, Clang, and the Arm compiler all seem to
> define __ARM_NEON, so here is a patch that uses that instead.

Is this also ever defined on 32-bit? If so, is it safe, meaning the
compiler will not emit these instructions without additional flags?
I'm wondering if  __aarch64__ would be clearer on that, and if we get
windows-on-arm support as has been proposed, could also add _M_ARM64.

I also see #if defined(__aarch64__) || defined(__aarch64) in our
codebase already, but I'm not sure what recognizes the latter.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Mon, Aug 22, 2022 at 11:50:35AM +0700, John Naylor wrote:
> On Sat, Aug 20, 2022 at 5:28 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Thanks for the pointer.  GCC, Clang, and the Arm compiler all seem to
>> define __ARM_NEON, so here is a patch that uses that instead.
> 
> Is this also ever defined on 32-bit? If so, is it safe, meaning the
> compiler will not emit these instructions without additional flags?
> I'm wondering if  __aarch64__ would be clearer on that, and if we get
> windows-on-arm support as has been proposed, could also add _M_ARM64.

I haven't been able to enable __ARM_NEON on 32-bit, but if it is somehow
possible, we should probably add an __aarch64__ check since functions like
vmaxvq_u32() do not appear to be available on 32-bit.  I have been able to
compile for __aarch64__ without __ARM_NEON, so it might still be a good
idea to check for __ARM_NEON.  So, to be safe, perhaps we should use
something like the following:

    #if (defined(__aarch64__) || defined(__aarch64)) && defined(__ARM_NEON)

> I also see #if defined(__aarch64__) || defined(__aarch64) in our
> codebase already, but I'm not sure what recognizes the latter.

I'm not sure what uses the latter, either.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Tue, Aug 23, 2022 at 4:15 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 11:50:35AM +0700, John Naylor wrote:

> > Is this also ever defined on 32-bit? If so, is it safe, meaning the
> > compiler will not emit these instructions without additional flags?
> > I'm wondering if  __aarch64__ would be clearer on that, and if we get
> > windows-on-arm support as has been proposed, could also add _M_ARM64.
>
> I haven't been able to enable __ARM_NEON on 32-bit, but if it is somehow
> possible, we should probably add an __aarch64__ check since functions like
> vmaxvq_u32() do not appear to be available on 32-bit.  I have been able to
> compile for __aarch64__ without __ARM_NEON, so it might still be a good
> idea to check for __ARM_NEON.

The important thing is: if we compile with __aarch64__ as a target:
- Will the compiler emit the intended instructions from the intrinsics
without extra flags?
- Can a user on ARM64 ever get a runtime fault if the machine attempts
to execute NEON instructions? "I have been able to compile for
__aarch64__ without __ARM_NEON" doesn't really answer that question --
what exactly did this entail?

> > I also see #if defined(__aarch64__) || defined(__aarch64) in our
> > codebase already, but I'm not sure what recognizes the latter.
>
> I'm not sure what uses the latter, either.

I took a quick look around at Debian code search, *BSD, Apple, and a
few other places, and I can't find it. Then, I looked at the
discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
support to s_lock.h", and the proposed patch [1] only had __aarch64__
. When it was committed, the platform was vaporware and I suppose we
included "__aarch64" as a prophylactic measure because no other reason
was given. It doesn't seem to exist anywhere, so unless someone can
demonstrate otherwise, I'm going to rip it out soon.

[1] https://www.postgresql.org/message-id/flat/1368448758.23422.12.camel%40t520.redhat.com

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
> The important thing is: if we compile with __aarch64__ as a target:
> - Will the compiler emit the intended instructions from the intrinsics
> without extra flags?

My testing with GCC and Clang did not require any extra flags.  GCC appears
to enable it by default for aarch64 [0].  AFAICT this is the case for Clang
as well, but that is based on the code and my testing (I couldn't find any
documentation for this).

> - Can a user on ARM64 ever get a runtime fault if the machine attempts
> to execute NEON instructions?

IIUC yes, although I'm not sure how likely it is in practice.

> "I have been able to compile for
> __aarch64__ without __ARM_NEON" doesn't really answer that question --
> what exactly did this entail?

Compiling with something like -march=armv8-a+nosimd prevents defining
__ARM_NEON.  Interestingly, Clang still defines __ARM_NEON__ even when
+nosimd is specified.

> I took a quick look around at Debian code search, *BSD, Apple, and a
> few other places, and I can't find it. Then, I looked at the
> discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
> support to s_lock.h", and the proposed patch [1] only had __aarch64__
> . When it was committed, the platform was vaporware and I suppose we
> included "__aarch64" as a prophylactic measure because no other reason
> was given. It doesn't seem to exist anywhere, so unless someone can
> demonstrate otherwise, I'm going to rip it out soon.

This is what I found, too, so +1.  I've attached a patch for this.

[0] https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
> > The important thing is: if we compile with __aarch64__ as a target:
> > - Will the compiler emit the intended instructions from the intrinsics
> > without extra flags?
>
> My testing with GCC and Clang did not require any extra flags.  GCC appears
> to enable it by default for aarch64 [0].  AFAICT this is the case for Clang
> as well, but that is based on the code and my testing (I couldn't find any
> documentation for this).

I guess you meant this part: "‘simd’ Enable Advanced SIMD
instructions. This also enables floating-point instructions. This is
on by default for all possible values for options -march and -mcpu."

> > - Can a user on ARM64 ever get a runtime fault if the machine attempts
> > to execute NEON instructions?
>
> IIUC yes, although I'm not sure how likely it is in practice.

Given the quoted part above, it doesn't seem likely, but we should try
to find out for sure, because a runtime fault is surely not acceptable
even on a toy system.

> > "I have been able to compile for
> > __aarch64__ without __ARM_NEON" doesn't really answer that question --
> > what exactly did this entail?
>
> Compiling with something like -march=armv8-a+nosimd prevents defining
> __ARM_NEON.

Okay, that's unsurprising.

> Interestingly, Clang still defines __ARM_NEON__ even when
> +nosimd is specified.

POLA violation, but if no one has complained to them, it's a good bet
the instructions are always available.

> > I took a quick look around at Debian code search, *BSD, Apple, and a
> > few other places, and I can't find it. Then, I looked at the
> > discussions around commit 5c7603c318872a42e "Add ARM64 (aarch64)
> > support to s_lock.h", and the proposed patch [1] only had __aarch64__
> > . When it was committed, the platform was vaporware and I suppose we
> > included "__aarch64" as a prophylactic measure because no other reason
> > was given. It doesn't seem to exist anywhere, so unless someone can
> > demonstrate otherwise, I'm going to rip it out soon.
>
> This is what I found, too, so +1.  I've attached a patch for this.

Thanks, I'll push this soon. I wondered if the same reasoning applies
to __arm__ / __arm nowadays, but a quick search does indicate that
__arm exists (existed?), at least.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Thu, Aug 25, 2022 at 10:38:34AM +0700, John Naylor wrote:
> On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
>> > - Can a user on ARM64 ever get a runtime fault if the machine attempts
>> > to execute NEON instructions?
>>
>> IIUC yes, although I'm not sure how likely it is in practice.
> 
> Given the quoted part above, it doesn't seem likely, but we should try
> to find out for sure, because a runtime fault is surely not acceptable
> even on a toy system.

The ARM literature appears to indicate that Neon support is pretty standard
on aarch64, and AFAICT it's pretty common to just assume it's available.
As originally suspected, I believe that simply checking for __aarch64__
would be sufficient, but I don't think it would be unreasonable to also
check for __ARM_NEON to be safe.

>> Interestingly, Clang still defines __ARM_NEON__ even when
>> +nosimd is specified.
> 
> POLA violation, but if no one has complained to them, it's a good bet
> the instructions are always available.

Sorry, I should've been more specific.  In my testing, I could include or
omit __ARM_NEON using +[no]simd, but __ARM_NEON__ (with two underscores at
the end) was always there.  My brief research seems to indicate this might
be unique to Darwin, but in the end, it looks like __ARM_NEON (without the
trailing underscores) is the most widely used.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Thu, Aug 25, 2022 at 11:57 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Aug 25, 2022 at 10:38:34AM +0700, John Naylor wrote:
> > On Thu, Aug 25, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> On Wed, Aug 24, 2022 at 11:07:03AM +0700, John Naylor wrote:
> >> > - Can a user on ARM64 ever get a runtime fault if the machine attempts
> >> > to execute NEON instructions?
> >>
> >> IIUC yes, although I'm not sure how likely it is in practice.
> >
> > Given the quoted part above, it doesn't seem likely, but we should try
> > to find out for sure, because a runtime fault is surely not acceptable
> > even on a toy system.
>
> The ARM literature appears to indicate that Neon support is pretty standard
> on aarch64, and AFAICT it's pretty common to just assume it's available.

This doesn't exactly rise to the level of "find out for sure", so I
went looking myself. This is the language I found [1]:

"Both floating-point and NEON are required in all standard ARMv8
implementations. However, implementations targeting specialized
markets may support the following combinations:

No NEON or floating-point.
Full floating-point and SIMD support with exception trapping.
Full floating-point and SIMD support without exception trapping."

Since we assume floating-point, I see no reason not to assume NEON,
but a case could be made for documenting that we require NEON on
aarch64, in addition to exception trapping (for CRC runtime check) and
floating point on any Arm. Or even just say "standard". I don't
believe anyone will want to run Postgres on specialized hardware
lacking these features, so maybe it's a moot point.

[1] https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Fri, Aug 26, 2022 at 10:45:10AM +0700, John Naylor wrote:
> On Thu, Aug 25, 2022 at 11:57 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> The ARM literature appears to indicate that Neon support is pretty standard
>> on aarch64, and AFAICT it's pretty common to just assume it's available.
> 
> This doesn't exactly rise to the level of "find out for sure", so I
> went looking myself. This is the language I found [1]:
> 
> "Both floating-point and NEON are required in all standard ARMv8
> implementations. However, implementations targeting specialized
> markets may support the following combinations:
> 
> No NEON or floating-point.
> Full floating-point and SIMD support with exception trapping.
> Full floating-point and SIMD support without exception trapping."

Sorry, I should've linked to the documentation I found.  I saw similar
language in a couple of manuals, which is what led me to the conclusion
that Neon support is relatively standard.

> Since we assume floating-point, I see no reason not to assume NEON,
> but a case could be made for documenting that we require NEON on
> aarch64, in addition to exception trapping (for CRC runtime check) and
> floating point on any Arm. Or even just say "standard". I don't
> believe anyone will want to run Postgres on specialized hardware
> lacking these features, so maybe it's a moot point.

I'm okay with assuming Neon support for now.  It's probably easier to add
the __ARM_NEON check if/when someone complains than it is to justify
removing it once it's there.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Thu, Aug 25, 2022 at 11:13:47PM -0700, Nathan Bossart wrote:
> Here is a new patch set that applies on top of v9-0001 in the
> json_lex_string patch set [0] and v3 of the is_valid_ascii patch [1].

Here is a rebased patch set that applies to HEAD.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Sat, Aug 27, 2022 at 1:24 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Here is a rebased patch set that applies to HEAD.

0001:

 #define USE_NO_SIMD
 typedef uint64 Vector8;
+typedef uint64 Vector32;
 #endif

I don't forsee any use of emulating vector registers with uint64 if
they only hold two ints. I wonder if it'd be better if all vector32
functions were guarded with #ifndef NO_USE_SIMD. (I wonder if
declarations without definitions cause warnings...)

+ * NB: This function assumes that each lane in the given vector either has all
+ * bits set or all bits zeroed, as it is mainly intended for use with
+ * operations that produce such vectors (e.g., vector32_eq()).  If this
+ * assumption is not true, this function's behavior is undefined.
+ */

Hmm?

Also, is_highbit_set() already has uses same intrinsic and has the
same intended effect, since we only care about the boolean result.

0002:

-#elif defined(USE_SSE2)
+#elif defined(USE_SSE2) || defined(USE_NEON)

I think we can just say #else.

-#if defined(USE_SSE2)
- __m128i sub;
+#ifndef USE_NO_SIMD
+ Vector8 sub;

+#elif defined(USE_NEON)
+
+ /* use the same approach as the USE_SSE2 block above */
+ sub = vqsubq_u8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);

I think we should invent a helper that does saturating subtraction and
call that, inlining the sub var so we don't need to mess with it
further.

Otherwise seems fine.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Tom Lane
Date:
I spent a bit more time researching the portability implications of
this patch.  I think that we should check __ARM_NEON before #including
<arm_neon.h>; there is authoritative documentation out there telling
you to, eg [1], and I can see no upside at all to not checking.
We cannot check *only* __ARM_NEON, though.  I found it to get defined
by clang 8.0.0 in my Fedora 30 32-bit image, although that does not
provide all the instructions we want (I see "undefined function"
complaints for vmaxvq_u8 etc if I try to make it use the patch).
Looking into that installation's <arm_neon.h>, those functions are
defined conditionally if "__ARM_FP & 2", which is kind of interesting
--- per [1], that bit indicates support for 16-bit floating point,
which seems a mite unrelated.

It appears from the info at [2] that there are at least some 32-bit
ARM platforms that set that bit, implying (if the clang authors are
well informed) that they have the instructions we want.  But we
could not realistically make 32-bit builds that try to use those
instructions without a run-time test; such a build would fail for
too many people.  I doubt that a run-time test is worth the trouble,
so I concur with the idea of selecting NEON on aarch64 only and hoping
to thereby avoid a runtime test.

In short, I think the critical part of 0002 needs to look more like
this:

+#elif defined(__aarch64__) && defined(__ARM_NEON)
+/*
+ * We use the Neon instructions if the compiler provides access to them
+ * (as indicated by __ARM_NEON) and we are on aarch64.  While Neon support is
+ * technically optional for aarch64, it appears that all available 64-bit
+ * hardware does have it.  Neon exists in some 32-bit hardware too, but
+ * we could not realistically use it there without a run-time check,
+ * which seems not worth the trouble for now.
+ */
+#include <arm_neon.h>
+#define USE_NEON
...

Coding like this appears to work on both my Apple M1 and my Raspberry
Pi, with several different OSes checked on the latter.

            regards, tom lane

[1]
https://developer.arm.com/documentation/101754/0618/armclang-Reference/Other-Compiler-specific-Features/Predefined-macros
[2] http://micro-os-plus.github.io/develop/predefined-macros/



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
Thanks for taking a look.

On Sat, Aug 27, 2022 at 01:59:06PM +0700, John Naylor wrote:
> I don't forsee any use of emulating vector registers with uint64 if
> they only hold two ints. I wonder if it'd be better if all vector32
> functions were guarded with #ifndef NO_USE_SIMD. (I wonder if
> declarations without definitions cause warnings...)

Yeah.  I was a bit worried about the readability of this file with so many
#ifndefs, but after trying it out, I suppose it doesn't look _too_ bad.

> + * NB: This function assumes that each lane in the given vector either has all
> + * bits set or all bits zeroed, as it is mainly intended for use with
> + * operations that produce such vectors (e.g., vector32_eq()).  If this
> + * assumption is not true, this function's behavior is undefined.
> + */
> 
> Hmm?

Yup.  The problem is that AFAICT there's no equivalent to
_mm_movemask_epi8() on aarch64, so you end up with something like

    vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0

But for pg_lfind32(), we really just want to know if any lane is set, which
only requires a call to vmaxvq_u32().  I haven't had a chance to look too
closely, but my guess is that this ultimately results in an extra AND
operation in the aarch64 path, so maybe it doesn't impact performance too
much.  The other option would be to open-code the intrinsic function calls
into pg_lfind.h.  I'm trying to avoid the latter, but maybe it's the right
thing to do for now...  What do you think?

> -#elif defined(USE_SSE2)
> +#elif defined(USE_SSE2) || defined(USE_NEON)
> 
> I think we can just say #else.

Yes.

> -#if defined(USE_SSE2)
> - __m128i sub;
> +#ifndef USE_NO_SIMD
> + Vector8 sub;
> 
> +#elif defined(USE_NEON)
> +
> + /* use the same approach as the USE_SSE2 block above */
> + sub = vqsubq_u8(v, vector8_broadcast(c));
> + result = vector8_has_zero(sub);
> 
> I think we should invent a helper that does saturating subtraction and
> call that, inlining the sub var so we don't need to mess with it
> further.

Good idea, will do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Sat, Aug 27, 2022 at 05:18:34PM -0400, Tom Lane wrote:
> In short, I think the critical part of 0002 needs to look more like
> this:
> 
> +#elif defined(__aarch64__) && defined(__ARM_NEON)
> +/*
> + * We use the Neon instructions if the compiler provides access to them
> + * (as indicated by __ARM_NEON) and we are on aarch64.  While Neon support is
> + * technically optional for aarch64, it appears that all available 64-bit
> + * hardware does have it.  Neon exists in some 32-bit hardware too, but
> + * we could not realistically use it there without a run-time check,
> + * which seems not worth the trouble for now.
> + */
> +#include <arm_neon.h>
> +#define USE_NEON
> ...

Thank you for the analysis!  I'll do it this way in the next patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Thomas Munro
Date:
On Sun, Aug 28, 2022 at 10:12 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> Yup.  The problem is that AFAICT there's no equivalent to
> _mm_movemask_epi8() on aarch64, so you end up with something like
>
>         vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0
>
> But for pg_lfind32(), we really just want to know if any lane is set, which
> only requires a call to vmaxvq_u32().  I haven't had a chance to look too
> closely, but my guess is that this ultimately results in an extra AND
> operation in the aarch64 path, so maybe it doesn't impact performance too
> much.  The other option would be to open-code the intrinsic function calls
> into pg_lfind.h.  I'm trying to avoid the latter, but maybe it's the right
> thing to do for now...  What do you think?

Ahh, this gives me a flashback to John's UTF-8 validation thread[1]
(the beginner NEON hackery in there was just a learning exercise,
sadly not followed up with real patches...).  He had
_mm_movemask_epi8(v) != 0 which I first translated to
to_bool(bitwise_and(v, vmovq_n_u8(0x80))) and he pointed out that
vmaxvq_u8(v) > 0x7F has the right effect without the and.

[1] https://www.postgresql.org/message-id/CA%2BhUKGJjyXvS6W05kRVpH6Kng50%3DuOGxyiyjgPKm707JxQYHCg%40mail.gmail.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Sun, Aug 28, 2022 at 10:39:09AM +1200, Thomas Munro wrote:
> On Sun, Aug 28, 2022 at 10:12 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> Yup.  The problem is that AFAICT there's no equivalent to
>> _mm_movemask_epi8() on aarch64, so you end up with something like
>>
>>         vmaxvq_u8(vandq_u8(v, vector8_broadcast(0x80))) != 0
>>
>> But for pg_lfind32(), we really just want to know if any lane is set, which
>> only requires a call to vmaxvq_u32().  I haven't had a chance to look too
>> closely, but my guess is that this ultimately results in an extra AND
>> operation in the aarch64 path, so maybe it doesn't impact performance too
>> much.  The other option would be to open-code the intrinsic function calls
>> into pg_lfind.h.  I'm trying to avoid the latter, but maybe it's the right
>> thing to do for now...  What do you think?
> 
> Ahh, this gives me a flashback to John's UTF-8 validation thread[1]
> (the beginner NEON hackery in there was just a learning exercise,
> sadly not followed up with real patches...).  He had
> _mm_movemask_epi8(v) != 0 which I first translated to
> to_bool(bitwise_and(v, vmovq_n_u8(0x80))) and he pointed out that
> vmaxvq_u8(v) > 0x7F has the right effect without the and.

I knew there had to be an easier way!  I'll give this a try.  Thanks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
Here is a new patch set in which I've attempted to address all feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Sun, Aug 28, 2022 at 10:58 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> Here is a new patch set in which I've attempted to address all feedback.

Looks in pretty good shape. Some more comments:

+ uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+ uint32 nelem_per_iteration = 4 * nelem_per_vector;

Using local #defines would be my style. I don't have a reason to
object to this way, but adding const makes these vars more clear.
Speaking of const:

- const __m128i tmp1 = _mm_or_si128(result1, result2);
- const __m128i tmp2 = _mm_or_si128(result3, result4);
- const __m128i result = _mm_or_si128(tmp1, tmp2);
+ tmp1 = vector32_or(result1, result2);
+ tmp2 = vector32_or(result3, result4);
+ result = vector32_or(tmp1, tmp2);

Any reason to throw away the const declarations?

+static inline bool
+vector32_is_highbit_set(const Vector32 v)
+{
+#ifdef USE_SSE2
+ return (_mm_movemask_epi8(v) & 0x8888) != 0;
+#endif
+}

I'm not sure why we need this function -- AFAICS it just adds more
work on x86 for zero benefit. For our present application, can we just
cast to Vector8 (for Arm's sake) and call the 8-bit version?

Aside from that, I plan on rewriting some comments for commit, some of
which pre-date this patch:

- * operations using bitwise operations on unsigned integers.
+ * operations using bitwise operations on unsigned integers.  Note that many
+ * of the functions in this file presently do not have non-SIMD
+ * implementations.

It's unclear to the reader whether this is a matter of 'round-to-it's.
I'd like to document what I asserted in this thread, that it's likely
not worthwhile to do anything with a uint64 representing two 32-bit
ints. (It *is* demonstrably worth it for handling 8 byte-values at a
time)

  * Use saturating subtraction to find bytes <= c, which will present as
- * NUL bytes in 'sub'.
+ * NUL bytes.

I'd like to to point out that the reason to do it this way is to
workaround SIMD architectures frequent lack of unsigned comparison.

+ * Return the result of subtracting the respective elements of the input
+ * vectors using saturation.

I wonder if we should explain briefly what saturating arithmetic is. I
had never encountered it outside of a SIMD programming context.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Tom Lane
Date:
John Naylor <john.naylor@enterprisedb.com> writes:
> I wonder if we should explain briefly what saturating arithmetic is. I
> had never encountered it outside of a SIMD programming context.

+1, it's at least worth a sentence to define the term.

            regards, tom lane



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Mon, Aug 29, 2022 at 11:25:50AM +0700, John Naylor wrote:
> + uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
> + uint32 nelem_per_iteration = 4 * nelem_per_vector;
> 
> Using local #defines would be my style. I don't have a reason to
> object to this way, but adding const makes these vars more clear.

I added const.

> Speaking of const:
> 
> - const __m128i tmp1 = _mm_or_si128(result1, result2);
> - const __m128i tmp2 = _mm_or_si128(result3, result4);
> - const __m128i result = _mm_or_si128(tmp1, tmp2);
> + tmp1 = vector32_or(result1, result2);
> + tmp2 = vector32_or(result3, result4);
> + result = vector32_or(tmp1, tmp2);
> 
> Any reason to throw away the const declarations?

The only reason is because I had to move the declarations to before the
vector32_load() calls.

> +static inline bool
> +vector32_is_highbit_set(const Vector32 v)
> +{
> +#ifdef USE_SSE2
> + return (_mm_movemask_epi8(v) & 0x8888) != 0;
> +#endif
> +}
> 
> I'm not sure why we need this function -- AFAICS it just adds more
> work on x86 for zero benefit. For our present application, can we just
> cast to Vector8 (for Arm's sake) and call the 8-bit version?

Good idea.

> - * operations using bitwise operations on unsigned integers.
> + * operations using bitwise operations on unsigned integers.  Note that many
> + * of the functions in this file presently do not have non-SIMD
> + * implementations.
> 
> It's unclear to the reader whether this is a matter of 'round-to-it's.
> I'd like to document what I asserted in this thread, that it's likely
> not worthwhile to do anything with a uint64 representing two 32-bit
> ints. (It *is* demonstrably worth it for handling 8 byte-values at a
> time)

Done.

>   * Use saturating subtraction to find bytes <= c, which will present as
> - * NUL bytes in 'sub'.
> + * NUL bytes.
> 
> I'd like to to point out that the reason to do it this way is to
> workaround SIMD architectures frequent lack of unsigned comparison.

Done.

> + * Return the result of subtracting the respective elements of the input
> + * vectors using saturation.
> 
> I wonder if we should explain briefly what saturating arithmetic is. I
> had never encountered it outside of a SIMD programming context.

Done.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Mon, Aug 29, 2022 at 12:44 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> [v6]

Pushed with a couple comment adjustments, let's see what the build
farm thinks...

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Mon, Aug 29, 2022 at 11:25 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
> +static inline bool
> +vector32_is_highbit_set(const Vector32 v)
> +{
> +#ifdef USE_SSE2
> + return (_mm_movemask_epi8(v) & 0x8888) != 0;
> +#endif
> +}
>
> I'm not sure why we need this function -- AFAICS it just adds more
> work on x86 for zero benefit. For our present application, can we just
> cast to Vector8 (for Arm's sake) and call the 8-bit version?

It turns out MSVC animal drongo doesn't like this cast -- on x86 they
are the same underlying type. Will look into that as more results come
in.

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Mon, Aug 29, 2022 at 3:19 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> It turns out MSVC animal drongo doesn't like this cast -- on x86 they
> are the same underlying type. Will look into that as more results come
> in.

Here's the simplest fix I can think of:

/*
 * Exactly like vector8_is_highbit_set except for the input type, so
it still looks
 * at each _byte_ separately.
 *
 * XXX x86 uses the same underlying type for vectors with 8-bit,
16-bit, and 32-bit
 * integer elements, but Arm does not, hence the need for a separate function.
 * We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e. check each
 * 32-bit element, but that would require an additional mask operation on x86.
 */
static inline bool
vector32_is_highbit_set(const Vector32 v)
{
#if defined(USE_NEON)
    return vector8_is_highbit_set((Vector8) v);
#else
    return vector8_is_highbit_set(v);
#endif
}

-- 
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Mon, Aug 29, 2022 at 4:28 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> Here's the simplest fix I can think of:
>
> /*
>  * Exactly like vector8_is_highbit_set except for the input type, so
> it still looks
>  * at each _byte_ separately.
>  *
>  * XXX x86 uses the same underlying type for vectors with 8-bit,
> 16-bit, and 32-bit
>  * integer elements, but Arm does not, hence the need for a separate function.
>  * We could instead adopt the behavior of Arm's vmaxvq_u32(), i.e. check each
>  * 32-bit element, but that would require an additional mask operation on x86.
>  */
> static inline bool
> vector32_is_highbit_set(const Vector32 v)
> {
> #if defined(USE_NEON)
>     return vector8_is_highbit_set((Vector8) v);
> #else
>     return vector8_is_highbit_set(v);
> #endif
> }

Bowerbird just reported the same error, so I went ahead and pushed a
fix with this.

--
John Naylor
EDB: http://www.enterprisedb.com



Re: use ARM intrinsics in pg_lfind32() where available

From
Nathan Bossart
Date:
On Mon, Aug 29, 2022 at 05:49:46PM +0700, John Naylor wrote:
> Bowerbird just reported the same error, so I went ahead and pushed a
> fix with this.

Thanks!  I've attached a follow-up patch with a couple of small
suggestions.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: use ARM intrinsics in pg_lfind32() where available

From
John Naylor
Date:
On Tue, Aug 30, 2022 at 12:17 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> Thanks!  I've attached a follow-up patch with a couple of small
> suggestions.

Pushed, thanks!

-- 
John Naylor
EDB: http://www.enterprisedb.com