Thread: add AVX2 support to simd.h

add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Wed, Nov 22, 2023 at 12:49:35PM -0600, Nathan Bossart wrote:
> On Wed, Nov 22, 2023 at 02:54:13PM +0200, Ants Aasma wrote:
>> For reference, executing the page checksum 10M times on a AMD 3900X CPU:
>> 
>> clang-14 -O2                 4.292s (17.8 GiB/s)
>> clang-14 -O2 -msse4.1        2.859s (26.7 GiB/s)
>> clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s)
> 
> Nice.  I've noticed similar improvements with AVX2 intrinsics in simd.h.

I've alluded to this a few times now, so I figured I'd park the patch and
preliminary benchmarks in a new thread while we iron out how to support
newer instructions (see discussion here [0]).

Using the same benchmark as we did for the SSE2 linear searches in
XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:

  writers    sse2    avx2     %
      256    1195    1188    -1
      512     928    1054   +14
     1024     633     716   +13
     2048     332     420   +27
     4096     162     203   +25
     8192     162     182   +12

It's been a while since I ran these benchmarks, but I vaguely recall also
seeing something like a 50% improvement for a dedicated pg_lfind32()
benchmark on long arrays.

As is, the patch likely won't do anything unless you add -mavx2 or
-march=native to your CFLAGS.  I don't intend for this patch to be
seriously considered until we have better support for detecting/compiling
AVX2 instructions and a buildfarm machine that uses them.

I plan to start another thread for AVX2 support for the page checksums.

[0] https://postgr.es/m/20231107024734.GB729644%40nathanxps13
[1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com
[2] https://postgr.es/m/20220713170950.GA3116318%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> I don't intend for this patch to be
> seriously considered until we have better support for detecting/compiling
> AVX2 instructions and a buildfarm machine that uses them.

That's completely understandable, yet I'm confused why there is a
commitfest entry for it marked "needs review".



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Mon, Jan 01, 2024 at 07:12:26PM +0700, John Naylor wrote:
> On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> I don't intend for this patch to be
>> seriously considered until we have better support for detecting/compiling
>> AVX2 instructions and a buildfarm machine that uses them.
> 
> That's completely understandable, yet I'm confused why there is a
> commitfest entry for it marked "needs review".

Perhaps I was too optimistic about adding support for newer instructions...

I'm tempted to propose that we move forward with this patch as-is after
adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3.
There is likely still follow-up work to make these improvements more
accessible, but I'm not sure that is a strict prerequisite here.

(In case it isn't clear, I'm volunteering to set up such a buildfarm
machine.)

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> I'm tempted to propose that we move forward with this patch as-is after
> adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3.
> There is likely still follow-up work to make these improvements more
> accessible, but I'm not sure that is a strict prerequisite here.

The patch needs better comments (as in, more than "none whatsoever").
It doesn't need to be much though, perhaps like

+#if defined(__AVX2__)
+
+/*
+ * When compiled with -mavx2 or allied options, we prefer AVX2 instructions.
+ */
+#include <immintrin.h>
+#define USE_AVX2
+typedef __m256i Vector8;
+typedef __m256i Vector32;

Also, do you really want to structure the header so that USE_SSE2
doesn't get defined?  In that case you are committing to provide
an AVX2 replacement every single place that there's USE_SSE2, which
doesn't seem like a great thing to require.  OTOH, maybe there's
no choice given than we need a different definition for Vector8 and
Vector32?

            regards, tom lane



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Jan 02, 2024 at 12:50:04PM -0500, Tom Lane wrote:
> The patch needs better comments (as in, more than "none whatsoever").

Yes, will do.

> Also, do you really want to structure the header so that USE_SSE2
> doesn't get defined?  In that case you are committing to provide
> an AVX2 replacement every single place that there's USE_SSE2, which
> doesn't seem like a great thing to require.  OTOH, maybe there's
> no choice given than we need a different definition for Vector8 and
> Vector32?

Yeah, the precedent is to use these abstracted types elsewhere so that any
SIMD-related improvements aren't limited to one architecture.  There are a
couple of places that do explicitly check for USE_NO_SIMD, though.  Maybe
there's an eventual use-case for using SSE2 intrinsics even when you have
AVX2 support, but for now, ensuring we have an AVX2 replacement for
everything doesn't seem particularly burdensome.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Tue, Jan 2, 2024 at 11:11 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Perhaps I was too optimistic about adding support for newer instructions...
>
> I'm tempted to propose that we move forward with this patch as-is after
> adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3.

That means that we would be on the hook to fix it if it breaks, even
though nothing uses it yet in a normal build. I have pending patches
that will break, or get broken by, this, so minus-many from me until
there is an availability story.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Wed, Jan 03, 2024 at 09:13:52PM +0700, John Naylor wrote:
> On Tue, Jan 2, 2024 at 11:11 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I'm tempted to propose that we move forward with this patch as-is after
>> adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3.
> 
> That means that we would be on the hook to fix it if it breaks, even
> though nothing uses it yet in a normal build. I have pending patches
> that will break, or get broken by, this, so minus-many from me until
> there is an availability story.

How will this break your patches?  Is it just a matter of adding more AVX2
support, or something else?

If the requirement is that normal builds use AVX2, then I fear we will be
waiting a long time.  IIUC the current proposals (building multiple
binaries or adding a configuration option that maps to compiler flags)
would still be opt-in, and I'm not sure we can mandate AVX2 support for all
x86_64 builds anytime soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Jan 02, 2024 at 10:11:23AM -0600, Nathan Bossart wrote:
> (In case it isn't clear, I'm volunteering to set up such a buildfarm
> machine.)

I set up "akepa" to run with -march=x86-64-v3.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Wed, Jan 3, 2024 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> If the requirement is that normal builds use AVX2, then I fear we will be
> waiting a long time.  IIUC the current proposals (building multiple
> binaries or adding a configuration option that maps to compiler flags)
> would still be opt-in,

If and when we get one of those, I would consider that  a "normal"
build. Since there are no concrete proposals yet, I'm still waiting
for you to justify imposing an immediate maintenance cost for zero
benefit.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Fri, Jan 05, 2024 at 09:03:39AM +0700, John Naylor wrote:
> On Wed, Jan 3, 2024 at 10:29 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> If the requirement is that normal builds use AVX2, then I fear we will be
>> waiting a long time.  IIUC the current proposals (building multiple
>> binaries or adding a configuration option that maps to compiler flags)
>> would still be opt-in,
> 
> If and when we get one of those, I would consider that  a "normal"
> build. Since there are no concrete proposals yet, I'm still waiting
> for you to justify imposing an immediate maintenance cost for zero
> benefit.

I've been thinking about the configuration option approach.  ISTM that
would be the most feasible strategy, at least for v17.  A couple things
come to mind:

* This option would simply map to existing compiler flags.  We already have
  ways to provide those (-Dc_args in meson, CFLAGS in autoconf).  Perhaps
  we'd want to provide our own shorthand for certain platforms (e.g., ARM),
  but that will still just be shorthand for compiler flags.

* Such an option would itself generate some maintenance cost.  That could
  be worth it because it formalizes the Postgres support for those options,
  but it's still one more thing to track.

Another related option could be to simply document that we have support for
some newer instructions that can be enabled by setting the aforementioned
compiler flags.  That's perhaps a little less user-friendly, but it'd avoid
the duplication and possibly reduce the maintenance cost.  I also wonder if
it'd help prevent confusion when CFLAGS and this extra option conflict.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

> Using the same benchmark as we did for the SSE2 linear searches in
> XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:

I've been antagonistic towards the patch itself, but it'd be more
productive if I paid some nuanced attention to the problem it's trying
to solve. First, I'd like to understand the benchmark a bit better.

>   writers    sse2    avx2     %
>       256    1195    1188    -1
>       512     928    1054   +14
>      1024     633     716   +13
>      2048     332     420   +27
>      4096     162     203   +25
>      8192     162     182   +12

There doesn't seem to be any benefit at 256 at all. Is that expected
and/or fine?

> It's been a while since I ran these benchmarks, but I vaguely recall also
> seeing something like a 50% improvement for a dedicated pg_lfind32()
> benchmark on long arrays.

The latest I see in
https://www.postgresql.org/message-id/20220808223254.GA1393216%40nathanxps13

writers  head  patch
8        672   680
16       639   664
32       701   689
64       705   703
128      628   653
256      576   627
512      530   584
768      450   536
1024     350   494

Here, the peak throughput seems to be around 64 writers with or
without the patch from a couple years ago, but the slope is shallower
after that. It would be good to make sure that it can't regress near
the peak, even with a "long tail" case (see next paragraph). The first
benchmark above starts at 256, so we can't tell where the peak is. It
might be worth it to also have a microbenchmark because the systemic
one has enough noise to obscure what's going on unless there are a
very large number of writers. We know what a systemic benchmark can
tell us on extreme workloads past the peak, and the microbenchmark
would tell us "we need to see X improvement here in order to see Y
improvement in the system benchmark".

I suspect that there could be a regression lurking for some inputs
that the benchmark doesn't look at: pg_lfind32() currently needs to be
able to read 4 vector registers worth of elements before taking the
fast path. There is then a tail of up to 15 elements that are now
checked one-by-one, but AVX2 would increase that to 31. That's getting
big enough to be noticeable, I suspect. It would be good to understand
that case (n*32 + 31), because it may also be relevant now. It's also
easy to improve for SSE2/NEON for v17.

Also, by reading 4 registers per loop iteration, that's 128 bytes on
AVX2. I'm not sure that matters, but we shouldn't assume it doesn't.
Code I've seen elsewhere reads a fixed 64-byte block, and then uses 1,
2, or 4 registers to handle it, depending on architecture. Whether or
not that's worth it in this case, this patch does mean future patches
will have to wonder if they have to do anything differently depending
on vector length, whereas now they don't. That's not a deal-breaker,
but it is a trade-off to keep in mind.



Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Sat, Jan 6, 2024 at 12:04 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

> I've been thinking about the configuration option approach.  ISTM that
> would be the most feasible strategy, at least for v17.  A couple things
> come to mind:
>
> * This option would simply map to existing compiler flags.  We already have
>   ways to provide those (-Dc_args in meson, CFLAGS in autoconf).  Perhaps
>   we'd want to provide our own shorthand for certain platforms (e.g., ARM),
>   but that will still just be shorthand for compiler flags.
>
> * Such an option would itself generate some maintenance cost.  That could
>   be worth it because it formalizes the Postgres support for those options,
>   but it's still one more thing to track.
>
> Another related option could be to simply document that we have support for
> some newer instructions that can be enabled by setting the aforementioned
> compiler flags.  That's perhaps a little less user-friendly, but it'd avoid
> the duplication and possibly reduce the maintenance cost.  I also wonder if
> it'd help prevent confusion when CFLAGS and this extra option conflict.

The last one might offer more graceful forward compatibility if the
multiple-binaries idea gets any traction some day, because at that
point the additional config options are not needed, I think.

Another consideration is which way would touch the fewest places to
work with Windows, which uses the spelling /arch:AVX2 etc.

One small thing I would hope for from the finial version of this is
the ability to inline things where we currently indirect depending on
a run-time check. That seems like "just work" on top of everything
else, and I don't think it makes a case for either of the above.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Mon, Jan 08, 2024 at 02:01:39PM +0700, John Naylor wrote:
> On Thu, Nov 30, 2023 at 12:15 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>>   writers    sse2    avx2     %
>>       256    1195    1188    -1
>>       512     928    1054   +14
>>      1024     633     716   +13
>>      2048     332     420   +27
>>      4096     162     203   +25
>>      8192     162     182   +12
> 
> There doesn't seem to be any benefit at 256 at all. Is that expected
> and/or fine?

My unverified assumption is that the linear searches make up much less of
the benchmark at these lower client counts, so any improvements we make
here are unlikely to show up here.  IIRC even the hash table approach that
we originally explored for XidInMVCCSnapshot() didn't do much, if anything,
for the benchmark at lower client counts.

> Here, the peak throughput seems to be around 64 writers with or
> without the patch from a couple years ago, but the slope is shallower
> after that. It would be good to make sure that it can't regress near
> the peak, even with a "long tail" case (see next paragraph). The first
> benchmark above starts at 256, so we can't tell where the peak is. It
> might be worth it to also have a microbenchmark because the systemic
> one has enough noise to obscure what's going on unless there are a
> very large number of writers. We know what a systemic benchmark can
> tell us on extreme workloads past the peak, and the microbenchmark
> would tell us "we need to see X improvement here in order to see Y
> improvement in the system benchmark".

Yes, will do.

> I suspect that there could be a regression lurking for some inputs
> that the benchmark doesn't look at: pg_lfind32() currently needs to be
> able to read 4 vector registers worth of elements before taking the
> fast path. There is then a tail of up to 15 elements that are now
> checked one-by-one, but AVX2 would increase that to 31. That's getting
> big enough to be noticeable, I suspect. It would be good to understand
> that case (n*32 + 31), because it may also be relevant now. It's also
> easy to improve for SSE2/NEON for v17.

Good idea.  If it is indeed noticeable, we might be able to "fix" it by
processing some of the tail with shorter vectors.  But that probably means
finding a way to support multiple vector sizes on the same build, which
would require some work.

> Also, by reading 4 registers per loop iteration, that's 128 bytes on
> AVX2. I'm not sure that matters, but we shouldn't assume it doesn't.
> Code I've seen elsewhere reads a fixed 64-byte block, and then uses 1,
> 2, or 4 registers to handle it, depending on architecture. Whether or
> not that's worth it in this case, this patch does mean future patches
> will have to wonder if they have to do anything differently depending
> on vector length, whereas now they don't. That's not a deal-breaker,
> but it is a trade-off to keep in mind.

Yeah.  Presently, this AVX2 patch just kicks the optimization down the road
a bit for the existing use-cases, so you don't start using the vector
registers until there's more data to work with, which might not even be
noticeable.  But it's conceivable that vector length could matter at some
point, even if it doesn't matter much now.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> > I suspect that there could be a regression lurking for some inputs
> > that the benchmark doesn't look at: pg_lfind32() currently needs to be
> > able to read 4 vector registers worth of elements before taking the
> > fast path. There is then a tail of up to 15 elements that are now
> > checked one-by-one, but AVX2 would increase that to 31. That's getting
> > big enough to be noticeable, I suspect. It would be good to understand
> > that case (n*32 + 31), because it may also be relevant now. It's also
> > easy to improve for SSE2/NEON for v17.
>
> Good idea.  If it is indeed noticeable, we might be able to "fix" it by
> processing some of the tail with shorter vectors.  But that probably means
> finding a way to support multiple vector sizes on the same build, which
> would require some work.

What I had in mind was an overlapping pattern I've seen in various
places: do one iteration at the beginning, then subtract the
aligned-down length from the end and do all those iterations. And
one-by-one is only used if the total length is small.



Re: add AVX2 support to simd.h

From
Peter Eisentraut
Date:
On 29.11.23 18:15, Nathan Bossart wrote:
> Using the same benchmark as we did for the SSE2 linear searches in
> XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:
> 
>    writers    sse2    avx2     %
>        256    1195    1188    -1
>        512     928    1054   +14
>       1024     633     716   +13
>       2048     332     420   +27
>       4096     162     203   +25
>       8192     162     182   +12

AFAICT, your patch merely provides an alternative AVX2 implementation 
for where currently SSE2 is supported, but it doesn't provide any new 
API calls or new functionality.  One might naively expect that these are 
just two different ways to call the underlying primitives in the CPU, so 
these performance improvements are surprising to me.  Or do the CPUs 
actually have completely separate machinery for SSE2 and AVX2, and just 
using the latter to do the same thing is faster?




Re: add AVX2 support to simd.h

From
Ants Aasma
Date:
On Tue, 9 Jan 2024 at 16:03, Peter Eisentraut <peter@eisentraut.org> wrote:
> On 29.11.23 18:15, Nathan Bossart wrote:
> > Using the same benchmark as we did for the SSE2 linear searches in
> > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:
> >
> >    writers    sse2    avx2     %
> >        256    1195    1188    -1
> >        512     928    1054   +14
> >       1024     633     716   +13
> >       2048     332     420   +27
> >       4096     162     203   +25
> >       8192     162     182   +12
>
> AFAICT, your patch merely provides an alternative AVX2 implementation
> for where currently SSE2 is supported, but it doesn't provide any new
> API calls or new functionality.  One might naively expect that these are
> just two different ways to call the underlying primitives in the CPU, so
> these performance improvements are surprising to me.  Or do the CPUs
> actually have completely separate machinery for SSE2 and AVX2, and just
> using the latter to do the same thing is faster?

The AVX2 implementation uses a wider vector register. On most current
processors the throughput of the instructions in question is the same
on 256bit vectors as on 128bit vectors. Basically, the chip has AVX2
worth of machinery and using SSE2 leaves half of it unused. Notable
exceptions are efficiency cores on recent Intel desktop CPUs and AMD
CPUs pre Zen 2 where AVX2 instructions are internally split up into
two 128bit wide instructions.

For AVX512 the picture is much more complicated. Some instructions run
at half rate, some at full rate, but not on all ALU ports, some
instructions cause aggressive clock rate reduction on some
microarchitectures. AVX-512 adds mask registers and masked vector
instructions that enable quite a bit simpler code in many cases.
Interestingly I have seen Clang make quite effective use of these
masked instructions even when using AVX2 intrinsics, but targeting an
AVX-512 capable platform.

The vector width independent approach used in the patch is nice for
simple cases by not needing a separate implementation for each vector
width. However for more complicated cases where "horizontal"
operations are needed it's going to be much less useful. But these
cases can easily just drop down to using intrinsics directly.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote:
> On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>>
>> > I suspect that there could be a regression lurking for some inputs
>> > that the benchmark doesn't look at: pg_lfind32() currently needs to be
>> > able to read 4 vector registers worth of elements before taking the
>> > fast path. There is then a tail of up to 15 elements that are now
>> > checked one-by-one, but AVX2 would increase that to 31. That's getting
>> > big enough to be noticeable, I suspect. It would be good to understand
>> > that case (n*32 + 31), because it may also be relevant now. It's also
>> > easy to improve for SSE2/NEON for v17.
>>
>> Good idea.  If it is indeed noticeable, we might be able to "fix" it by
>> processing some of the tail with shorter vectors.  But that probably means
>> finding a way to support multiple vector sizes on the same build, which
>> would require some work.
> 
> What I had in mind was an overlapping pattern I've seen in various
> places: do one iteration at the beginning, then subtract the
> aligned-down length from the end and do all those iterations. And
> one-by-one is only used if the total length is small.

Sorry, I'm not sure I understood this.  Do you mean processing the first
several elements individually or with SSE2 until the number of remaining
elements can be processed with just the AVX2 instructions (a bit like how
pg_comp_crc32c_armv8() is structured for memory alignment)?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Ants Aasma
Date:
On Tue, 9 Jan 2024 at 18:20, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote:
> > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >>
> >> > I suspect that there could be a regression lurking for some inputs
> >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be
> >> > able to read 4 vector registers worth of elements before taking the
> >> > fast path. There is then a tail of up to 15 elements that are now
> >> > checked one-by-one, but AVX2 would increase that to 31. That's getting
> >> > big enough to be noticeable, I suspect. It would be good to understand
> >> > that case (n*32 + 31), because it may also be relevant now. It's also
> >> > easy to improve for SSE2/NEON for v17.
> >>
> >> Good idea.  If it is indeed noticeable, we might be able to "fix" it by
> >> processing some of the tail with shorter vectors.  But that probably means
> >> finding a way to support multiple vector sizes on the same build, which
> >> would require some work.
> >
> > What I had in mind was an overlapping pattern I've seen in various
> > places: do one iteration at the beginning, then subtract the
> > aligned-down length from the end and do all those iterations. And
> > one-by-one is only used if the total length is small.
>
> Sorry, I'm not sure I understood this.  Do you mean processing the first
> several elements individually or with SSE2 until the number of remaining
> elements can be processed with just the AVX2 instructions (a bit like how
> pg_comp_crc32c_armv8() is structured for memory alignment)?

For some operations (min, max, = any) processing the same elements
multiple times doesn't change the result. So the vectors for first
and/or last iterations can overlap with the main loop. In other cases
it's possible to mask out the invalid elements and replace them with
zeroes. Something along the lines of:

static inline Vector8
vector8_mask_right(int num_valid)
{
    __m256i seq = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24,
                                  23, 22, 21, 20, 19, 18, 17, 16,
                                  15, 14, 13, 12, 11, 10, 9, 8,
                                  7, 6, 5, 4, 3, 2, 1, 0);
    return _mm256_cmpgt_epi8(_mm256_set1_epi8(num_valid), seq);
}

/* final incomplete iteration */
Vector8 mask = vector8_mask_right(end - cur);
final_vec = vector8_and((Vector8*) (end - sizeof(Vector8), mask);
accum = vector8_add(accum, final_vec);

It helps that on any halfway recent x86 unaligned loads only have a
minor performance penalty and only when straddling cache line
boundaries. Not sure what the  state on ARM is. If we don't care about
unaligned loads then we only need to care about the load not crossing
page boundaries which could cause segfaults. Though I'm sure memory
sanitizer tools will have plenty to complain about around such hacks.



Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Tue, Jan 9, 2024 at 11:20 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote:
> > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >>
> >> > I suspect that there could be a regression lurking for some inputs
> >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be
> >> > able to read 4 vector registers worth of elements before taking the
> >> > fast path. There is then a tail of up to 15 elements that are now
> >> > checked one-by-one, but AVX2 would increase that to 31. That's getting
> >> > big enough to be noticeable, I suspect. It would be good to understand
> >> > that case (n*32 + 31), because it may also be relevant now. It's also
> >> > easy to improve for SSE2/NEON for v17.
> >>
> >> Good idea.  If it is indeed noticeable, we might be able to "fix" it by
> >> processing some of the tail with shorter vectors.  But that probably means
> >> finding a way to support multiple vector sizes on the same build, which
> >> would require some work.
> >
> > What I had in mind was an overlapping pattern I've seen in various
> > places: do one iteration at the beginning, then subtract the
> > aligned-down length from the end and do all those iterations. And
> > one-by-one is only used if the total length is small.
>
> Sorry, I'm not sure I understood this.  Do you mean processing the first
> several elements individually or with SSE2 until the number of remaining
> elements can be processed with just the AVX2 instructions (a bit like how
> pg_comp_crc32c_armv8() is structured for memory alignment)?

If we have say 25 elements, I mean (for SSE2) check the first 16, then
the last 16. Some will be checked twice, but that's okay.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Wed, Jan 10, 2024 at 09:06:08AM +0700, John Naylor wrote:
> If we have say 25 elements, I mean (for SSE2) check the first 16, then
> the last 16. Some will be checked twice, but that's okay.

I finally got around to trying this.  0001 adds this overlapping logic.
0002 is a rebased version of the AVX2 patch (it needed some updates after
commit 9f225e9).  And 0003 is a benchmark for test_lfind32().  It runs
pg_lfind32() on an array of the given size 100M times.

I've also attached the results of running this benchmark on my machine at
HEAD, after applying 0001, and after applying both 0001 and 0002.  0001
appears to work pretty well.  When there is a small "tail," it regresses a
small amount, but overall, it seems to improve more cases than it harms.
0002 does regress searches on smaller arrays quite a bit, since it
postpones the SIMD optimizations until the arrays are longer.  It might be
possible to mitigate by using 2 registers when the "tail" is long enough,
but I have yet to try that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote:
> I've also attached the results of running this benchmark on my machine at
> HEAD, after applying 0001, and after applying both 0001 and 0002.  0001
> appears to work pretty well.  When there is a small "tail," it regresses a
> small amount, but overall, it seems to improve more cases than it harms.
> 0002 does regress searches on smaller arrays quite a bit, since it
> postpones the SIMD optimizations until the arrays are longer.  It might be
> possible to mitigate by using 2 registers when the "tail" is long enough,
> but I have yet to try that.

The attached 0003 is a sketch of what such mitigation might look like.  It
appears to help with the regressions nicely.  I omitted the benchmarking
patch in v3 to appease cfbot.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Sat, Mar 16, 2024 at 2:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote:
> > I've also attached the results of running this benchmark on my machine at
> > HEAD, after applying 0001, and after applying both 0001 and 0002.  0001
> > appears to work pretty well.  When there is a small "tail," it regresses a
> > small amount, but overall, it seems to improve more cases than it harms.
> > 0002 does regress searches on smaller arrays quite a bit, since it
> > postpones the SIMD optimizations until the arrays are longer.  It might be
> > possible to mitigate by using 2 registers when the "tail" is long enough,
> > but I have yet to try that.
>
> The attached 0003 is a sketch of what such mitigation might look like.  It
> appears to help with the regressions nicely.  I omitted the benchmarking
> patch in v3 to appease cfbot.

I haven't looked at the patches, but the graphs look good.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote:
> I haven't looked at the patches, but the graphs look good.

I spent some more time on these patches.  Specifically, I reordered them to
demonstrate the effects on systems without AVX2 support.  I've also added a
shortcut to jump to the one-by-one approach when there aren't many
elements, as the overhead becomes quite noticeable otherwise.  Finally, I
ran the same benchmarks again on x86 and Arm out to 128 elements.

Overall, I think 0001 and 0002 are in decent shape, although I'm wondering
if it's possible to improve the style a bit.  0003 at least needs a big
comment in simd.h, and it might need a note in the documentation, too.  If
the approach in this patch set seems reasonable, I'll spend some time on
that.

BTW I did try to add some other optimizations, such as processing remaining
elements with only one vector and trying to use the overlapping strategy
with more registers if we know there are relatively many remaining
elements.  These other approaches all added a lot of complexity and began
hurting performance, and I've probably already spent way too much time
optimizing a linear search, so this is where I've decided to stop.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Tue, Mar 19, 2024 at 9:03 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote:
> > I haven't looked at the patches, but the graphs look good.
>
> I spent some more time on these patches.  Specifically, I reordered them to
> demonstrate the effects on systems without AVX2 support.  I've also added a
> shortcut to jump to the one-by-one approach when there aren't many
> elements, as the overhead becomes quite noticeable otherwise.  Finally, I
> ran the same benchmarks again on x86 and Arm out to 128 elements.
>
> Overall, I think 0001 and 0002 are in decent shape, although I'm wondering
> if it's possible to improve the style a bit.

I took a brief look, and 0001 isn't quite what I had in mind. I can't
quite tell what it's doing with the additional branches and "goto
retry", but I meant something pretty simple:

- if short, do one element at a time and return
- if long, do one block unconditionally, then round the start pointer
up so that "end - start" is an exact multiple of blocks, and loop over
them



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Mar 19, 2024 at 10:03:36AM +0700, John Naylor wrote:
> I took a brief look, and 0001 isn't quite what I had in mind. I can't
> quite tell what it's doing with the additional branches and "goto
> retry", but I meant something pretty simple:

Do you mean 0002?  0001 just adds a 2-register loop for remaining elements
once we've exhausted what can be processed with the 4-register loop.

> - if short, do one element at a time and return

0002 does this.

> - if long, do one block unconditionally, then round the start pointer
> up so that "end - start" is an exact multiple of blocks, and loop over
> them

0002 does the opposite of this.  That is, after we've completed as many
blocks as possible, we move the iterator variable back to "end -
block_size" and do one final iteration to cover all the remaining elements.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Tue, Mar 19, 2024 at 10:03:36AM +0700, John Naylor wrote:
> > I took a brief look, and 0001 isn't quite what I had in mind. I can't
> > quite tell what it's doing with the additional branches and "goto
> > retry", but I meant something pretty simple:
>
> Do you mean 0002?  0001 just adds a 2-register loop for remaining elements
> once we've exhausted what can be processed with the 4-register loop.

Sorry, I was looking at v2 at the time.

> > - if short, do one element at a time and return
>
> 0002 does this.

That part looks fine.

> > - if long, do one block unconditionally, then round the start pointer
> > up so that "end - start" is an exact multiple of blocks, and loop over
> > them
>
> 0002 does the opposite of this.  That is, after we've completed as many
> blocks as possible, we move the iterator variable back to "end -
> block_size" and do one final iteration to cover all the remaining elements.

Sounds similar in principle, but it looks really complicated. I don't
think the additional loops and branches are a good way to go, either
for readability or for branch prediction. My sketch has one branch for
which loop to do, and then performs only one loop. Let's do the
simplest thing that could work. (I think we might need a helper
function to do the block, but the rest should be easy)



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Mar 19, 2024 at 04:53:04PM +0700, John Naylor wrote:
> On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> 0002 does the opposite of this.  That is, after we've completed as many
>> blocks as possible, we move the iterator variable back to "end -
>> block_size" and do one final iteration to cover all the remaining elements.
> 
> Sounds similar in principle, but it looks really complicated. I don't
> think the additional loops and branches are a good way to go, either
> for readability or for branch prediction. My sketch has one branch for
> which loop to do, and then performs only one loop. Let's do the
> simplest thing that could work. (I think we might need a helper
> function to do the block, but the rest should be easy)

I tried to trim some of the branches, and came up with the attached patch.
I don't think this is exactly what you were suggesting, but I think it's
relatively close.  My testing showed decent benefits from using 2 vectors
when there aren't enough elements for 4, so I've tried to keep that part
intact.  This changes pg_lfind32() to something like:

    if not many elements
        process one by one

    while enough elements for 4 registers remain
        process with 4 registers

    if no elements remain
        return false

    if more than 2-registers-worth of elements remain
        do one iteration with 2 registers

    do another iteration on last 2-registers-worth of elements

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> > Sounds similar in principle, but it looks really complicated. I don't
> > think the additional loops and branches are a good way to go, either
> > for readability or for branch prediction. My sketch has one branch for
> > which loop to do, and then performs only one loop. Let's do the
> > simplest thing that could work. (I think we might need a helper
> > function to do the block, but the rest should be easy)
>
> I tried to trim some of the branches, and came up with the attached patch.
> I don't think this is exactly what you were suggesting, but I think it's
> relatively close.  My testing showed decent benefits from using 2 vectors
> when there aren't enough elements for 4, so I've tried to keep that part
> intact.

I would caution against that if the benchmark is repeatedly running
against a static number of elements, because the branch predictor will
be right all the time (except maybe when it exits a loop, not sure).
We probably don't need to go to the trouble to construct a benchmark
with some added randomness, but we have be careful not to overfit what
the test is actually measuring.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Wed, Mar 20, 2024 at 01:57:54PM +0700, John Naylor wrote:
> On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> I tried to trim some of the branches, and came up with the attached patch.
>> I don't think this is exactly what you were suggesting, but I think it's
>> relatively close.  My testing showed decent benefits from using 2 vectors
>> when there aren't enough elements for 4, so I've tried to keep that part
>> intact.
> 
> I would caution against that if the benchmark is repeatedly running
> against a static number of elements, because the branch predictor will
> be right all the time (except maybe when it exits a loop, not sure).
> We probably don't need to go to the trouble to construct a benchmark
> with some added randomness, but we have be careful not to overfit what
> the test is actually measuring.

I don't mind removing the 2-register stuff if that's what you think we
should do.  I'm cautiously optimistic that it'd help more than the extra
branch prediction might hurt, and it'd at least help avoid regressing the
lower end for the larger AVX2 registers, but I probably won't be able to
prove that without constructing another benchmark.  And TBH I'm not sure
it'll significantly impact any real-world workload, anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote:
> On Wed, Mar 20, 2024 at 01:57:54PM +0700, John Naylor wrote:
>> On Tue, Mar 19, 2024 at 11:30 PM Nathan Bossart
>> <nathandbossart@gmail.com> wrote:
>>> I tried to trim some of the branches, and came up with the attached patch.
>>> I don't think this is exactly what you were suggesting, but I think it's
>>> relatively close.  My testing showed decent benefits from using 2 vectors
>>> when there aren't enough elements for 4, so I've tried to keep that part
>>> intact.
>> 
>> I would caution against that if the benchmark is repeatedly running
>> against a static number of elements, because the branch predictor will
>> be right all the time (except maybe when it exits a loop, not sure).
>> We probably don't need to go to the trouble to construct a benchmark
>> with some added randomness, but we have be careful not to overfit what
>> the test is actually measuring.
> 
> I don't mind removing the 2-register stuff if that's what you think we
> should do.  I'm cautiously optimistic that it'd help more than the extra
> branch prediction might hurt, and it'd at least help avoid regressing the
> lower end for the larger AVX2 registers, but I probably won't be able to
> prove that without constructing another benchmark.  And TBH I'm not sure
> it'll significantly impact any real-world workload, anyway.

Here's a new version of the patch set with the 2-register stuff removed,
plus a fresh run of the benchmark.  The weird spike for AVX2 is what led me
down the 2-register path earlier.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Thu, Mar 21, 2024 at 2:55 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Mar 20, 2024 at 09:31:16AM -0500, Nathan Bossart wrote:

> > I don't mind removing the 2-register stuff if that's what you think we
> > should do.  I'm cautiously optimistic that it'd help more than the extra
> > branch prediction might hurt, and it'd at least help avoid regressing the
> > lower end for the larger AVX2 registers, but I probably won't be able to
> > prove that without constructing another benchmark.  And TBH I'm not sure
> > it'll significantly impact any real-world workload, anyway.
>
> Here's a new version of the patch set with the 2-register stuff removed,

I'm much happier about v5-0001. With a small tweak it would match what
I had in mind:

+ if (nelem < nelem_per_iteration)
+ goto one_by_one;

If this were "<=" then the for long arrays we could assume there is
always more than one block, and wouldn't need to check if any elements
remain -- first block, then a single loop and it's done.

The loop could also then be a "do while" since it doesn't have to
check the exit condition up front.

> plus a fresh run of the benchmark.  The weird spike for AVX2 is what led me
> down the 2-register path earlier.

Yes, that spike is weird, because it seems super-linear. However, the
more interesting question for me is: AVX2 isn't really buying much for
the numbers covered in this test. Between 32 and 48 elements, and
between 64 and 80, it's indistinguishable from SSE2. The jumps to the
next shelf are postponed, but the jumps are just as high. From earlier
system benchmarks, I recall it eventually wins out with hundreds of
elements, right? Is that still true?

Further, now that the algorithm is more SIMD-appropriate, I wonder
what doing 4 registers at a time is actually buying us for either SSE2
or AVX2. It might just be a matter of scale, but that would be good to
understand.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote:
> I'm much happier about v5-0001. With a small tweak it would match what
> I had in mind:
> 
> + if (nelem < nelem_per_iteration)
> + goto one_by_one;
> 
> If this were "<=" then the for long arrays we could assume there is
> always more than one block, and wouldn't need to check if any elements
> remain -- first block, then a single loop and it's done.
> 
> The loop could also then be a "do while" since it doesn't have to
> check the exit condition up front.

Good idea.  That causes us to re-check all of the tail elements when the
number of elements is evenly divisible by nelem_per_iteration, but that
might be worth the trade-off.

> Yes, that spike is weird, because it seems super-linear. However, the
> more interesting question for me is: AVX2 isn't really buying much for
> the numbers covered in this test. Between 32 and 48 elements, and
> between 64 and 80, it's indistinguishable from SSE2. The jumps to the
> next shelf are postponed, but the jumps are just as high. From earlier
> system benchmarks, I recall it eventually wins out with hundreds of
> elements, right? Is that still true?

It does still eventually win, although not nearly to the same extent as
before.  I extended the benchmark a bit to show this.  I wouldn't be
devastated if we only got 0001 committed for v17, given these results.

> Further, now that the algorithm is more SIMD-appropriate, I wonder
> what doing 4 registers at a time is actually buying us for either SSE2
> or AVX2. It might just be a matter of scale, but that would be good to
> understand.

I'll follow up with these numbers shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Thu, Mar 21, 2024 at 12:09:44PM -0500, Nathan Bossart wrote:
> It does still eventually win, although not nearly to the same extent as
> before.  I extended the benchmark a bit to show this.  I wouldn't be
> devastated if we only got 0001 committed for v17, given these results.

(In case it isn't clear from the graph, after 128 elements, I only tested
at 200, 300, 400, etc. elements.)

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Thu, Mar 21, 2024 at 12:09:44PM -0500, Nathan Bossart wrote:
> On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote:
>> Further, now that the algorithm is more SIMD-appropriate, I wonder
>> what doing 4 registers at a time is actually buying us for either SSE2
>> or AVX2. It might just be a matter of scale, but that would be good to
>> understand.
> 
> I'll follow up with these numbers shortly.

It looks like the 4-register code still outperforms the 2-register code,
except for a handful of cases where there aren't many elements.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
Here's a new version of 0001 with some added #ifdefs that cfbot revealed
were missing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Sun, Mar 24, 2024 at 03:53:17PM -0500, Nathan Bossart wrote:
> Here's a new version of 0001 with some added #ifdefs that cfbot revealed
> were missing.

Sorry for the noise.  cfbot revealed another silly mistake (forgetting to
reset the "i" variable in the assertion path).  That should be fixed in v8.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
John Naylor
Date:
On Fri, Mar 22, 2024 at 12:09 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 11:30:30AM +0700, John Naylor wrote:

> > If this were "<=" then the for long arrays we could assume there is
> > always more than one block, and wouldn't need to check if any elements
> > remain -- first block, then a single loop and it's done.
> >
> > The loop could also then be a "do while" since it doesn't have to
> > check the exit condition up front.
>
> Good idea.  That causes us to re-check all of the tail elements when the
> number of elements is evenly divisible by nelem_per_iteration, but that
> might be worth the trade-off.

Yeah, if there's no easy way to avoid that it's probably fine. I
wonder if we can subtract one first to force even multiples to round
down, although I admit I haven't thought through the consequences of
that.

> [v8]

Seems pretty good. It'd be good to see the results of 2- vs.
4-register before committing, because that might lead to some
restructuring, but maybe it won't, and v8 is already an improvement
over HEAD.

/* Process the remaining elements one at a time. */

This now does all of them if that path is taken, so "remaining" can be removed.



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Mon, Mar 25, 2024 at 10:03:27AM +0700, John Naylor wrote:
> Seems pretty good. It'd be good to see the results of 2- vs.
> 4-register before committing, because that might lead to some
> restructuring, but maybe it won't, and v8 is already an improvement
> over HEAD.

I tested this the other day [0] (only for x86).  The results seemed to
indicate that the 4-register approach was still quite a bit better.

> /* Process the remaining elements one at a time. */
> 
> This now does all of them if that path is taken, so "remaining" can be removed.

Right, will do.

[0] https://postgr.es/m/20240321183823.GA1800896%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
Here is what I have staged for commit.  One notable difference in this
version of the patch is that I've changed

    +    if (nelem <= nelem_per_iteration)
    +        goto one_by_one;

to

    +    if (nelem < nelem_per_iteration)
    +        goto one_by_one;

I realized that there's no reason to jump to the one-by-one linear search
code when nelem == nelem_per_iteration, as the worst thing that will happen
is that we'll process all the elements twice if the value isn't present in
the array.  My benchmark that I've been using also shows a significant
speedup for this case with this change (on the order of 75%), which I
imagine might be due to a combination of branch prediction, caching, fewer
instructions, etc.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
I've committed v9, and I've marked the commitfest entry as "Committed,"
although we may want to revisit AVX2, etc. in the future.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> I've committed v9, and I've marked the commitfest entry as "Committed,"
> although we may want to revisit AVX2, etc. in the future.

A significant fraction of the buildfarm is issuing warnings about
this.

 adder         | 2024-03-26 21:04:33 | ../pgsql/src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined
butnot used [-Wunused-label] 
 buri          | 2024-03-26 21:16:09 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but
notused [-Wunused-label] 
 cavefish      | 2024-03-26 22:53:23 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but
notused [-Wunused-label] 
 cisticola     | 2024-03-26 22:20:07 | ../../../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one'
definedbut not used [-Wunused-label] 
 lancehead     | 2024-03-26 21:48:17 | ../../src/include/port/pg_lfind.h:199:1: warning: unused label 'one_by_one'
[-Wunused-label]
 nicator       | 2024-03-26 21:08:14 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but
notused [-Wunused-label] 
 nuthatch      | 2024-03-26 22:00:04 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but
notused [-Wunused-label] 
 rinkhals      | 2024-03-26 19:51:32 | ../../src/include/port/pg_lfind.h:199:1: warning: unused label 'one_by_one'
[-Wunused-label]
 siskin        | 2024-03-26 19:59:29 | ../../src/include/port/pg_lfind.h:199:1: warning: label 'one_by_one' defined but
notused [-Wunused-label] 

            regards, tom lane



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I've committed v9, and I've marked the commitfest entry as "Committed,"
>> although we may want to revisit AVX2, etc. in the future.
> 
> A significant fraction of the buildfarm is issuing warnings about
> this.

Thanks for the heads-up.  Will fix.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Mar 26, 2024 at 06:55:54PM -0500, Nathan Bossart wrote:
> On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote:
>> A significant fraction of the buildfarm is issuing warnings about
>> this.
> 
> Thanks for the heads-up.  Will fix.

Done.  I'll keep an eye on the farm.

I just did the minimal fix for now, i.e., I moved the new label into the
SIMD section of the function.  I think it would be better stylistically to
move the one-by-one logic to an inline helper function, but I didn't do
that just in case it might negatively impact performance.  I'll look into
this and will follow up with another patch if it looks good.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: add AVX2 support to simd.h

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Tue, Mar 26, 2024 at 06:55:54PM -0500, Nathan Bossart wrote:
>> On Tue, Mar 26, 2024 at 07:28:24PM -0400, Tom Lane wrote:
>>> A significant fraction of the buildfarm is issuing warnings about
>>> this.

> Done.  I'll keep an eye on the farm.

Thanks.

> I just did the minimal fix for now, i.e., I moved the new label into the
> SIMD section of the function.  I think it would be better stylistically to
> move the one-by-one logic to an inline helper function, but I didn't do
> that just in case it might negatively impact performance.  I'll look into
> this and will follow up with another patch if it looks good.

Sounds like a plan.

            regards, tom lane



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Tue, Mar 26, 2024 at 09:48:57PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I just did the minimal fix for now, i.e., I moved the new label into the
>> SIMD section of the function.  I think it would be better stylistically to
>> move the one-by-one logic to an inline helper function, but I didn't do
>> that just in case it might negatively impact performance.  I'll look into
>> this and will follow up with another patch if it looks good.
> 
> Sounds like a plan.

Here's what I had in mind.  My usual benchmark seems to indicate that this
shouldn't impact performance.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> Here's what I had in mind.  My usual benchmark seems to indicate that this
> shouldn't impact performance.

Shouldn't "i" be declared uint32, since nelem is?

BTW, I wonder why these functions don't declare their array
arguments like "const uint32 *base".

LGTM otherwise, and I like the fact that the #if structure
gets a lot less messy.

            regards, tom lane



Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Wed, Mar 27, 2024 at 05:10:13PM -0400, Tom Lane wrote:
> Shouldn't "i" be declared uint32, since nelem is?

Yes, that's a mistake.

> BTW, I wonder why these functions don't declare their array
> arguments like "const uint32 *base".

They probably should.  I don't see any reason not to, and my compiler
doesn't complain, either.
 
> LGTM otherwise, and I like the fact that the #if structure
> gets a lot less messy.

Thanks for reviewing.  I've attached a v2 that I intend to commit when I
get a chance.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: add AVX2 support to simd.h

From
Nathan Bossart
Date:
On Wed, Mar 27, 2024 at 04:37:35PM -0500, Nathan Bossart wrote:
> On Wed, Mar 27, 2024 at 05:10:13PM -0400, Tom Lane wrote:
>> LGTM otherwise, and I like the fact that the #if structure
>> gets a lot less messy.
> 
> Thanks for reviewing.  I've attached a v2 that I intend to commit when I
> get a chance.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com