Thread: Auto-vectorization speeds up multiplication of large-precision numerics

Auto-vectorization speeds up multiplication of large-precision numerics

From
Amit Khandekar
Date:
There is this for loop in mul_var() :
/*
 * Add the appropriate multiple of var2 into the accumulator.
 *
 * As above, digits of var2 can be ignored if they don't contribute,
 * so we only include digits for which i1+i2+2 <= res_ndigits - 1.
 */
for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
     i2 >= 0; i2--)
    dig[i--] += var1digit * var2digits[i2];

With gcc -O3, the above for loop, if simplified, gets auto-vectorized
[1] ; and this results in speedups for multiplication of PostgreSQL
numeric types having large precisions. The speedups start becoming
noticeable from around 50 precision onwards. With 50 precision the
improvement I saw was 5%, with 60 11%, 120 50%, 240 2.2x, and so on.
On my arm64 machine, a similar benefit starts showing up from 20
precision onwards. I used this query from regress/sql/numeric_big.sql
:
SELECT t1.val * t2.val  FROM num_data t1, num_data t2
If I use the schema created by numeric_big.sql, the speedup was 2.5x
to 2.7x across three machines.

Also, the regress/sql/numeric_big test itself speeds up by 80%

For the for loop to be auto-vectorized, I had to simplify it to
something like this :
i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
digptr = &dig[i1 + 2];
for (i = 0; i <= i2; i++)
   digptr[i] += var1digit * var2digits[i];

gcc also can vectorize backward loop such as this :
for (i = n-1; i >= 0; i--)
   a += b[i];
gcc -fopt-info-all gives this info :
numeric.c:7217:3: optimized: loop vectorized using 16 byte vectors

But if the assignment is not as simple as above, it does not vectorize
the backward traversal :
i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
digptr = &dig[i1 + i2 + 2];
for (; i2 >= 0; i2--)
   digptr[i2] += var1digit * var2digits[i2];
numeric.c:7380:3: missed: couldn't vectorize loop
numeric.c:7381:15: missed: not vectorized: relevant stmt not
supported: _39 = *_38;

Even for forward loop traversal, the below can't be vectorized
seemingly because it involves two variables :
count = Min(var2ndigits - 1, res_ndigits - i1 - 3) + 1;
i = i1 + i2 - count + 3;
for (i2 = 0; i2 < count; i++, i2++)
   dig[i] += var1digit * var2digits[i2];
numeric.c:7394:3: missed: couldn't vectorize loop
numeric.c:7395:11: missed: not vectorized: not suitable for gather
load _37 = *_36;

So it's better to keep the loop simple :
i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
digptr = &dig[i1 + 2];
for (i = 0; i <= i2; i++)
   digptr[i] += var1digit * var2digits[i];
numeric.c:7387:3: optimized: loop vectorized using 16 byte vectors

Attached is the patch that uses the above loop.

With the patch, in mul_var() assembly code, I could see the
multiply-accumulate instructions that operate on SIMD vectors (these
are arm64 instructions) :
    smlal   v1.4s, v2.4h, v3.4h
    smlal2  v0.4s, v2.8h, v3.8h


I extracted the "SELECT t1.val * t2.val  FROM num_data t1, num_data
t2" query from regress/sql/numeric_big.sql, and ran it on the data
that the test creates (it inserts values with precisions ranging from
500 to 700). Attached is create_schema.sql which creates the
regression test schema.
With this query, below are the changes in mul_var() figures with and
without patch :
(All the below figures are with -O3 build.)

HEAD :

+   64.06%  postgres  postgres            [.] mul_var
+   13.00%  postgres  postgres            [.] get_str_from_var
+    6.32%  postgres  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
+    1.65%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
+    1.10%  postgres  [kernel.kallsyms]   [k] _raw_spin_lock
+    0.96%  postgres  [kernel.kallsyms]   [k] get_page_from_freelist
+    0.73%  postgres  [kernel.kallsyms]   [k] page_counter_try_charge
+    0.64%  postgres  postgres            [.] AllocSetAlloc

Patched :

+   35.91%  postgres  postgres            [.] mul_var
+   20.43%  postgres  postgres            [.] get_str_from_var
+   13.01%  postgres  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
+    2.31%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
+    1.48%  postgres  [kernel.kallsyms]   [k] _raw_spin_lock
+    1.15%  postgres  [kernel.kallsyms]   [k] get_page_from_freelist
+    0.99%  postgres  postgres            [.] AllocSetAlloc
+    0.58%  postgres  postgres            [.] base_yyparse

Times in milliseconds for  SELECT t1.val * t2.val  FROM num_data t1,
num_data t2 :
Machine 1 (amd64)
Head    : .668 .723 .658 .660
Patched : .288 .280 .282 .282
Machine 2 (arm64)
Head    : .897 .879 .888 .897
Patched : .329 .324 .321 .320

Average times in milliseconds for numeric_big regression test :
Machine 1 (amd64)
Head    : 801
Patched : 445
Machine 2 (arm64)
Head    : 1105
Patched : 550


gcc -O3 option :

I understand we have kept the default gcc CFLAGS to -O2, because, I
believe, we might enable some bugs due to some assumptions in the
code, if we make it -O3. But with this patch, we allow products built
with -O3 flag to get this benefit.

The actual gcc option to enable auto-vectorization is
-ftree-loop-vectorize. But for -O3 it is always true. What we can do
in the future is to have a separate file that has such optimized code
that is proven to work with such optimization flags, and enable the
required compiler flags only for such files, if the build is done with
-O2.

[1] https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using


-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Attachment

Re: Auto-vectorization speeds up multiplication of large-precisionnumerics

From
Peter Eisentraut
Date:
On 2020-06-09 13:50, Amit Khandekar wrote:
> Also, the regress/sql/numeric_big test itself speeds up by 80%

That's nice.  I can confirm the speedup:

-O3 without the patch:

      numeric                      ... ok          737 ms
test numeric_big                  ... ok         1014 ms

-O3 with the patch:

      numeric                      ... ok          680 ms
test numeric_big                  ... ok          580 ms

Also:

-O2 without the patch:

      numeric                      ... ok          693 ms
test numeric_big                  ... ok         1160 ms

-O2 with the patch:

      numeric                      ... ok          677 ms
test numeric_big                  ... ok          917 ms

So the patch helps either way.  But it also seems that without the 
patch, -O3 might be a bit slower in some cases.  This might need more 
testing.

> For the for loop to be auto-vectorized, I had to simplify it to
> something like this :

Well, how do we make sure we keep it that way?  How do we prevent some 
random rearranging of the code or some random compiler change to break 
this again?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Amit Khandekar
Date:
On Wed, 10 Jun 2020 at 04:20, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
>
> On 2020-06-09 13:50, Amit Khandekar wrote:
> > Also, the regress/sql/numeric_big test itself speeds up by 80%
>
> That's nice.  I can confirm the speedup:
>
> -O3 without the patch:
>
>       numeric                      ... ok          737 ms
> test numeric_big                  ... ok         1014 ms
>
> -O3 with the patch:
>
>       numeric                      ... ok          680 ms
> test numeric_big                  ... ok          580 ms
>
> Also:
>
> -O2 without the patch:
>
>       numeric                      ... ok          693 ms
> test numeric_big                  ... ok         1160 ms
>
> -O2 with the patch:
>
>       numeric                      ... ok          677 ms
> test numeric_big                  ... ok          917 ms
>
> So the patch helps either way.

Oh, I didn't observe that the patch helps numeric_big.sql to speed up
to some extent even with -O2. For me, it takes 805 on head and 732 ms
with patch. One possible reason that I can think of is : Because of
the forward loop traversal, pre-fetching might be helping. But this is
just a wild guess.

-O3 : HEAD
test numeric                      ... ok          102 ms
test numeric_big                  ... ok          803 ms

-O3 : patched :
test numeric                      ... ok          100 ms
test numeric_big                  ... ok          450 ms


-O2 : HEAD
test numeric                      ... ok          100 ms
test numeric_big                  ... ok          805 ms

-O2 patched :
test numeric                      ... ok          103 ms
test numeric_big                  ... ok          732 ms

> But it also seems that without the patch, -O3 might
> be a bit slower in some cases. This might need more testing.

For me, there is no observed change in the times with -O2 versus -O3,
on head. Are you getting a consistent slower numeric*.sql tests with
-O3 on current code ? Not sure what might be the reason.
But this is not related to the patch. Is it with the context of patch
that you are suggesting that it might need more testing ? There might
be existing cases that might be running a bit slower with O3, but
that's strange actually. Probably optimization in those cases might
not be working as thought by the compiler, and in fact they might be
working negatively. Probably that's one of the reasons -O2 is the
default choice.


>
> > For the for loop to be auto-vectorized, I had to simplify it to
> > something like this :
>
> Well, how do we make sure we keep it that way?  How do we prevent some
> random rearranging of the code or some random compiler change to break
> this again?

I believe the compiler rearranges the code segments w.r.t. one another
when those are independent of each other. I guess the compiler is able
to identify that. With our case, it's the for loop. There are some
variables used inside it, and that would prevent it from moving the
for loop. Even if the compiler finds it safe to move relative to
surrounding code, it would not spilt the for loop contents themselves,
so the for loop will remain intact, and so would the vectorization,
although it seems to keep an unrolled version of the loop in case it
is called with smaller iteration counts. But yes, if someone in the
future tries to change the for loop, it would possibly break the
auto-vectorization. So, we should have appropriate comments (patch has
those). Let me know if you find any possible reasons due to which the
compiler would in the future stop the vectorization even when there is
no change in the for loop.

It might look safer if we take the for loop out into an inline
function; just to play it safe ?



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Amit Khandekar
Date:
FYI : this one is present in the July commitfest.



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Peter Eisentraut
Date:
On 2020-06-10 14:15, Amit Khandekar wrote:
>> Well, how do we make sure we keep it that way?  How do we prevent some
>> random rearranging of the code or some random compiler change to break
>> this again?
> I believe the compiler rearranges the code segments w.r.t. one another
> when those are independent of each other. I guess the compiler is able
> to identify that. With our case, it's the for loop. There are some
> variables used inside it, and that would prevent it from moving the
> for loop. Even if the compiler finds it safe to move relative to
> surrounding code, it would not spilt the for loop contents themselves,
> so the for loop will remain intact, and so would the vectorization,
> although it seems to keep an unrolled version of the loop in case it
> is called with smaller iteration counts. But yes, if someone in the
> future tries to change the for loop, it would possibly break the
> auto-vectorization. So, we should have appropriate comments (patch has
> those). Let me know if you find any possible reasons due to which the
> compiler would in the future stop the vectorization even when there is
> no change in the for loop.

We normally don't compile with -O3, so very few users would get the 
benefit of this.  We have CFLAGS_VECTOR for the checksum code.  I 
suppose if we are making the numeric code vectorizable as well, we 
should apply this there also.

I think we need a bit of a policy decision from the group here.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> We normally don't compile with -O3, so very few users would get the 
> benefit of this.

Yeah.  I don't think changing that baseline globally would be a wise move.

> We have CFLAGS_VECTOR for the checksum code.  I 
> suppose if we are making the numeric code vectorizable as well, we 
> should apply this there also.

> I think we need a bit of a policy decision from the group here.

I'd vote in favor of applying CFLAGS_VECTOR to specific source files
that can benefit.  We already have experience with that and we've not
detected any destabilization potential.

(I've not looked at this patch, so don't take this as a +1 for this
specific patch.)

            regards, tom lane



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Amit Khandekar
Date:
On Fri, 10 Jul 2020 at 19:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> > We normally don't compile with -O3, so very few users would get the
> > benefit of this.
>
> Yeah.  I don't think changing that baseline globally would be a wise move.
>
> > We have CFLAGS_VECTOR for the checksum code.  I
> > suppose if we are making the numeric code vectorizable as well, we
> > should apply this there also.
>
> > I think we need a bit of a policy decision from the group here.
>
> I'd vote in favor of applying CFLAGS_VECTOR to specific source files
> that can benefit.  We already have experience with that and we've not
> detected any destabilization potential.

I tried this in utils/adt/Makefile :
+
+numeric.o: CFLAGS += ${CFLAGS_VECTOR}
+
and it works.

CFLAGS_VECTOR also includes the -funroll-loops option, which I
believe, had showed improvements in the checksum.c runs ( [1] ). This
option makes the object file a bit bigger. For numeric.o, it's size
increased by 15K; from 116672 to 131360 bytes. I ran the
multiplication test, and didn't see any additional speed-up with this
option. Also, it does not seem to be related to vectorization. So I
was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and
CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and
numeric.c can use only CFLAGS_VECTOR.

I was also wondering if it's worth to extract only the code that we
think can be optimized and keep it in a separate file (say
numeric_vectorize.c or adt_vectorize.c, which can have mul_var() to
start with), and use this file as a collection of all such code in the
adt module, and then we can add such files into other modules as and
when needed. For numeric.c, there can be already some scope for
auto-vectorizations in other similar regions in that file, so we don't
require a separate numeric_vectorize.c and just pass the CFLAGS_VECTOR
flag for this file itself.


[1]
https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e

-- 
Thanks,
-Amit Khandekar
Huawei Technologies



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Amit Khandekar
Date:
On Mon, 13 Jul 2020 at 14:27, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I tried this in utils/adt/Makefile :
> +
> +numeric.o: CFLAGS += ${CFLAGS_VECTOR}
> +
> and it works.
>
> CFLAGS_VECTOR also includes the -funroll-loops option, which I
> believe, had showed improvements in the checksum.c runs ( [1] ). This
> option makes the object file a bit bigger. For numeric.o, it's size
> increased by 15K; from 116672 to 131360 bytes. I ran the
> multiplication test, and didn't see any additional speed-up with this
> option. Also, it does not seem to be related to vectorization. So I
> was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and
> CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and
> numeric.c can use only CFLAGS_VECTOR.

I did as above. Attached is the v2 patch.

In case of existing CFLAGS_VECTOR, an env variable also could be set
by that name when running configure. I did the same for
CFLAGS_UNROLL_LOOPS.

Now, developers who already are using CFLAGS_VECTOR env while
configur'ing might be using this env because their compilers don't
have these compiler options  so they must be using some equivalent
compiler options. numeric.c will now be compiled with CFLAGS_VECTOR,
so for them  it will now be compiled with their equivalent of
vectorize and unroll-loops option, which is ok, I think. Just that the
numeric.o size will be increased, that's it.

>
> [1]
https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e




-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Attachment

Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Tom Lane
Date:
Amit Khandekar <amitdkhan.pg@gmail.com> writes:
> I did as above. Attached is the v2 patch.

I made some cosmetic changes to this and committed it.  AFAICT,
there's no measurable performance change to the "numeric" regression
test, but I got a solid 45% speedup on "numeric_big", so it's
clearly a win for wider arithmetic cases.

It seemed to me to be useful to actually rename CFLAGS_VECTOR
if we're changing its meaning, so I made it CFLAGS_VECTORIZE.

            regards, tom lane



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Tom Lane
Date:
I wrote:
> I made some cosmetic changes to this and committed it.

BTW, poking at this further, it seems that the patch only really
works for gcc.  clang accepts the -ftree-vectorize switch, but
looking at the generated asm shows that it does nothing useful.
Which is odd, because clang does do loop vectorization.

I tried adding -Rpass-analysis=loop-vectorize and got

numeric.c:8341:3: remark: loop not vectorized: could not determine number of loop iterations
[-Rpass-analysis=loop-vectorize]
                for (i2 = 0; i2 <= i; i2++)
                ^

which is interesting but I don't know how to proceed further.

            regards, tom lane



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Amit Khandekar
Date:
On Mon, 7 Sep 2020 at 11:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > I made some cosmetic changes to this and committed it.

Thanks!

>
> BTW, poking at this further, it seems that the patch only really
> works for gcc.  clang accepts the -ftree-vectorize switch, but
> looking at the generated asm shows that it does nothing useful.
> Which is odd, because clang does do loop vectorization.
>
> I tried adding -Rpass-analysis=loop-vectorize and got
>
> numeric.c:8341:3: remark: loop not vectorized: could not determine number of loop iterations
[-Rpass-analysis=loop-vectorize]
>                 for (i2 = 0; i2 <= i; i2++)

Hmm, yeah that's unfortunate. My guess is that the compiler would do
vectorization only if 'i' is a constant, which is not true for our
case.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Tom Lane
Date:
Amit Khandekar <amitdkhan.pg@gmail.com> writes:
> On Mon, 7 Sep 2020 at 11:23, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, poking at this further, it seems that the patch only really
>> works for gcc.  clang accepts the -ftree-vectorize switch, but
>> looking at the generated asm shows that it does nothing useful.
>> Which is odd, because clang does do loop vectorization.

> Hmm, yeah that's unfortunate. My guess is that the compiler would do
> vectorization only if 'i' is a constant, which is not true for our
> case.

No, they claim to handle variable trip counts, per

https://llvm.org/docs/Vectorizers.html#loops-with-unknown-trip-count

I experimented with a few different ideas such as adding restrict
decoration to the pointers, and eventually found that what works
is to write the loop termination condition as "i2 < limit"
rather than "i2 <= limit".  It took me a long time to think of
trying that, because it seemed ridiculously stupid.  But it works.

            regards, tom lane



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Tom Lane
Date:
I wrote:
> I experimented with a few different ideas such as adding restrict
> decoration to the pointers, and eventually found that what works
> is to write the loop termination condition as "i2 < limit"
> rather than "i2 <= limit".  It took me a long time to think of
> trying that, because it seemed ridiculously stupid.  But it works.

I've done more testing and confirmed that both gcc and clang can
vectorize the improved loop on aarch64 as well as x86_64.  (clang's
results can be confusing because -ftree-vectorize doesn't seem to
have any effect: its vectorizer is on by default.  But if you use
-fno-vectorize it'll go back to the old, slower code.)

The only buildfarm effect I've noticed is that locust and
prairiedog, which are using nearly the same ancient gcc version,
complain

c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing is ignored when Auto Vectorization is used.

which is expected (they say the same for checksum.c), but then
there are a bunch of

warning: dereferencing type-punned pointer will break strict-aliasing rules

which seems worrisome.  (This sort of thing is the reason I'm
hesitant to apply higher optimization levels across the board.)
Both animals pass the regression tests anyway, but if any other
compilers treat -ftree-vectorize as an excuse to apply stricter
optimization assumptions, we could be in for trouble.

I looked closer and saw that all of those warnings are about
init_var(), and this change makes them go away:

-#define init_var(v)        MemSetAligned(v, 0, sizeof(NumericVar))
+#define init_var(v)        memset(v, 0, sizeof(NumericVar))

I'm a little inclined to commit that as future-proofing.  It's
essentially reversing out a micro-optimization I made in d72f6c750.
I doubt I had hard evidence that it made any noticeable difference;
and even if it did back then, modern compilers probably prefer the
memset approach.

            regards, tom lane



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Amit Khandekar
Date:
On Tue, 8 Sep 2020 at 02:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > I experimented with a few different ideas such as adding restrict
> > decoration to the pointers, and eventually found that what works
> > is to write the loop termination condition as "i2 < limit"
> > rather than "i2 <= limit".  It took me a long time to think of
> > trying that, because it seemed ridiculously stupid.  But it works.

Ah ok.

I checked the "Auto-Vectorization in LLVM" link that you shared. All
the examples use "< n" or "> n". None of them use "<= n". Looks like a
hidden restriction.

>
> I've done more testing and confirmed that both gcc and clang can
> vectorize the improved loop on aarch64 as well as x86_64.  (clang's
> results can be confusing because -ftree-vectorize doesn't seem to
> have any effect: its vectorizer is on by default.  But if you use
> -fno-vectorize it'll go back to the old, slower code.)
>
> The only buildfarm effect I've noticed is that locust and
> prairiedog, which are using nearly the same ancient gcc version,
> complain
>
> c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing is ignored when Auto Vectorization is
used.
>
> which is expected (they say the same for checksum.c), but then
> there are a bunch of
>
> warning: dereferencing type-punned pointer will break strict-aliasing rules
>
> which seems worrisome.  (This sort of thing is the reason I'm
> hesitant to apply higher optimization levels across the board.)
> Both animals pass the regression tests anyway, but if any other
> compilers treat -ftree-vectorize as an excuse to apply stricter
> optimization assumptions, we could be in for trouble.
>
> I looked closer and saw that all of those warnings are about
> init_var(), and this change makes them go away:
>
> -#define init_var(v)        MemSetAligned(v, 0, sizeof(NumericVar))
> +#define init_var(v)        memset(v, 0, sizeof(NumericVar))
>
> I'm a little inclined to commit that as future-proofing.  It's
> essentially reversing out a micro-optimization I made in d72f6c750.
> I doubt I had hard evidence that it made any noticeable difference;
> and even if it did back then, modern compilers probably prefer the
> memset approach.

Thanks. I must admit it did not occur to me that I could have very
well installed clang on my linux machine and tried compiling this
file, or tested with some older gcc versions. I think I was using gcc
8. Do you know what was the gcc compiler version that gave these
warnings ?

-- 
Thanks,
-Amit Khandekar
Huawei Technologies



Re: Auto-vectorization speeds up multiplication of large-precision numerics

From
Tom Lane
Date:
Amit Khandekar <amitdkhan.pg@gmail.com> writes:
> Thanks. I must admit it did not occur to me that I could have very
> well installed clang on my linux machine and tried compiling this
> file, or tested with some older gcc versions. I think I was using gcc
> 8. Do you know what was the gcc compiler version that gave these
> warnings ?

Per the buildfarm's configure logs, prairiedog is using

configure: using compiler=powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5341)

IIRC, locust has a newer build number but it's the same underlying gcc
version.

            regards, tom lane