Thread: LLVM 16 (opaque pointers)

LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
Hi,

Here is a draft version of the long awaited patch to support LLVM 16.
It's mostly mechanical donkeywork, but it took some time: this donkey
found it quite hard to understand the mighty getelementptr
instruction[1] and the code generation well enough to figure out all
the right types, and small mistakes took some debugging effort*.  I
now finally have a patch that passes all tests.

Though it's not quite ready yet, I thought I should give this status
update to report that the main task is more or less complete, since
we're starting to get quite a few emails about it (mostly from Fedora
users) and there is an entry for it on the Open Items for 16 wiki
page.  Comments/review/testing welcome.

Here are some things I think I need to do next (probably after PGCon):

1. If you use non-matching clang and LLVM versions I think we might
use "clang -no-opaque-pointers" at the wrong times (I've not looked
into that interaction yet).
2. The treatment of function types is a bit inconsistent/messy and
could be tidied up.
3. There are quite a lot of extra function calls that could perhaps be
elided (ie type variables instead of LLVMTypeInt8(), and calls to
LLVMStructGetTypeAtIndex() that are not used in LLVM < 16).
4. Could use some comments.
5. I need to test with very old versions of LLVM and Clang that we
claim to support (I have several years' worth of releases around but
nothing older than 9).
6. I need to go through the types again with a fine tooth comb, and
check the test coverage to look out for eg GEP array arithmetic with
the wrong type/size that isn't being exercised.

*For anyone working with this type of IR generation code and
questioning their sanity, I can pass on some excellent advice I got
from Andres: build LLVM yourself with assertions enabled, as they
catch some classes of silly mistake that otherwise just segfault
inscrutably on execution.

[1] https://llvm.org/docs/GetElementPtr.html

Attachment

Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
Oh, one important thing I forgot to mention: that patch is for LLVM 16
only, and I developed it with a local build of their "release/16.x"
branch on a FreeBSD box, and also tested with a released package for
16 on a Debian box.  Further changes are already needed for their
"main" branch (LLVM 17-to-be), so this won't quite be enough to shut
seawasp up.  At a glance, we will need to change from the "old pass
manager" API that has recently been vaporised[1]
(llvm-c/Transforms/PassManagerBuilder.h) to the new one[2][3]
(llvm-c/Transforms/PassBuilder.h), which I suspect/hope will be as
simple as changing llvmjit.c to call LLVMRunPasses() with a string
describing the passes we want in "opt -passes" format, instead of our
code that calls LLVMAddFunctionInlingPass() etc.  But that'll be a
topic for another day, and another thread.

[1] https://github.com/llvm/llvm-project/commit/0aac9a2875bad4f065367e4a6553fad78605f895
[2] https://llvm.org/docs/NewPassManager.html
[3] https://reviews.llvm.org/D102136



Re: LLVM 16 (opaque pointers)

From
Dmitry Dolgov
Date:
> On Mon, May 22, 2023 at 03:38:44PM +1200, Thomas Munro wrote:
> Further changes are already needed for their "main" branch (LLVM
> 17-to-be), so this won't quite be enough to shut seawasp up.  At a
> glance, we will need to change from the "old pass manager" API that
> has recently been vaporised[1]
> (llvm-c/Transforms/PassManagerBuilder.h) to the new one[2][3]
> (llvm-c/Transforms/PassBuilder.h), which I suspect/hope will be as
> simple as changing llvmjit.c to call LLVMRunPasses() with a string
> describing the passes we want in "opt -passes" format, instead of our
> code that calls LLVMAddFunctionInlingPass() etc.  But that'll be a
> topic for another day, and another thread.
>
> [1] https://github.com/llvm/llvm-project/commit/0aac9a2875bad4f065367e4a6553fad78605f895
> [2] https://llvm.org/docs/NewPassManager.html
> [3] https://reviews.llvm.org/D102136

Thanks for tackling the topic! I've tested it with a couple of versions,
LLVM 12 that comes with my Gentoo box, LLVM 15 build from sources and
the modified version of patch adopted for LLVM 17 (build form sources as
well). In all three cases everything seems to be working fine.

Simple benchmarking with a query stolen from some other jit thread
(pgbench running single client with multiple unions of selects a-la
SELECT a, count(*), sum(b) FROM test WHERE c = 2 GROUP BY a) show some
slight performance differences, but nothing dramatic so far. LLVM 17
version produces the lowest latency, with faster generation, inlining
and optimization, but slower emission time. LLVM 12 version produces the
largest latencies with everything except emission timings being slower.
LLVM 15 is somewhere in between.

I'll continue reviewing and, for the records, attach adjustments I was
using for LLVM 17 (purely for testing, not taking into account other
versions), in case if I've missed something.

Attachment

Re: LLVM 16 (opaque pointers)

From
Ronan Dunklau
Date:
Le dimanche 21 mai 2023, 05:01:41 CEST Thomas Munro a écrit :
> Hi,
>
> Here is a draft version of the long awaited patch to support LLVM 16.
> It's mostly mechanical donkeywork, but it took some time: this donkey
> found it quite hard to understand the mighty getelementptr
> instruction[1] and the code generation well enough to figure out all
> the right types, and small mistakes took some debugging effort*.  I
> now finally have a patch that passes all tests.
>
> Though it's not quite ready yet, I thought I should give this status
> update to report that the main task is more or less complete, since
> we're starting to get quite a few emails about it (mostly from Fedora
> users) and there is an entry for it on the Open Items for 16 wiki
> page.  Comments/review/testing welcome.

Hello Thomas,

Thank you for this effort !

I've tested it against llvm 15 and 16, and found no problem with it.

> 6. I need to go through the types again with a fine tooth comb, and
> check the test coverage to look out for eg GEP array arithmetic with
> the wrong type/size that isn't being exercised.

I haven't gone through the test coverage myself, but I exercised the following
things:

 - running make installcheck with jit_above_cost = 0
 - letting sqlsmith hammer random queries at it for a few hours.

This didn't show obvious issues.

> *For anyone working with this type of IR generation code and
> questioning their sanity, I can pass on some excellent advice I got
> from Andres: build LLVM yourself with assertions enabled, as they
> catch some classes of silly mistake that otherwise just segfault
> > inscrutably on execution.

I tried my hand at backporting it to previous versions, and not knowing
anything about it made me indeed question my sanity.  It's quite easy for PG
15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier
most notably due to to the change in fcinfo args representation. But I guess
that's also a topic for another day :-)

Best regards,

--
Ronan Dunklau







Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-08-10 16:56:54 +0200, Ronan Dunklau wrote:
> I tried my hand at backporting it to previous versions, and not knowing
> anything about it made me indeed question my sanity.  It's quite easy for PG
> 15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier
> most notably due to to the change in fcinfo args representation. But I guess
> that's also a topic for another day :-)

Given that 11 is about to be EOL, I don't think it's worth spending the time
to support a new LLVM version for it.

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-05-21 15:01:41 +1200, Thomas Munro wrote:
> *For anyone working with this type of IR generation code and
> questioning their sanity, I can pass on some excellent advice I got
> from Andres: build LLVM yourself with assertions enabled, as they
> catch some classes of silly mistake that otherwise just segfault
> inscrutably on execution.

Hm. I think we need a buildfarm animal with an assertion enabled llvm 16 once
we merge this. I think after an upgrade my buildfarm machine has the necessary
resources.


> @@ -150,7 +150,7 @@ llvm_compile_expr(ExprState *state)
>  
>      /* create function */
>      eval_fn = LLVMAddFunction(mod, funcname,
> -                              llvm_pg_var_func_type("TypeExprStateEvalFunc"));
> +                              llvm_pg_var_func_type("ExecInterpExprStillValid"));

Hm, that's a bit ugly. But ...

> @@ -77,9 +80,44 @@ extern Datum AttributeTemplate(PG_FUNCTION_ARGS);
>  Datum
>  AttributeTemplate(PG_FUNCTION_ARGS)
>  {
> +    PGFunction    fp PG_USED_FOR_ASSERTS_ONLY;
> +
> +    fp = &AttributeTemplate;
>      PG_RETURN_NULL();
>  }

Other parts of the file do this by putting the functions into
referenced_functions[], i'd copy that here and below.

> +void
> +ExecEvalSubroutineTemplate(ExprState *state,
> +                           struct ExprEvalStep *op,
> +                           ExprContext *econtext)
> +{
> +    ExecEvalSubroutine fp PG_USED_FOR_ASSERTS_ONLY;
> +
> +    fp = &ExecEvalSubroutineTemplate;
> +}
> +
> +extern bool ExecEvalBoolSubroutineTemplate(ExprState *state,
> +                                           struct ExprEvalStep *op,
> +                                           ExprContext *econtext);
> +bool
> +ExecEvalBoolSubroutineTemplate(ExprState *state,
> +                               struct ExprEvalStep *op,
> +                               ExprContext *econtext)
> +{
> +    ExecEvalBoolSubroutine fp PG_USED_FOR_ASSERTS_ONLY;
> +
> +    fp = &ExecEvalBoolSubroutineTemplate;
> +    return false;
> +}
> +


Thanks for working on this!

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Fabien COELHO
Date:
> [...] Further changes are already needed for their "main" branch (LLVM 
> 17-to-be), so this won't quite be enough to shut seawasp up.

For information, the physical server which was hosting my 2 bf animals 
(seawasp and moonjelly) has given up rebooting after a power cut a few 
weeks/months ago, and I have not setup a replacement (yet).

-- 
Fabien.



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
Belated thanks Dmitry, Ronan, Andres for your feedback.  Here's a new
version, also including Dmitry's patch for 17 which it is now also
time to push.  It required a bit more trivial #if magic to be
conditional, as Dmitry already mentioned.  I just noticed that Dmitry
had the LLVMPassBuilderOptionsSetInlinerThreshold() function added to
LLVM 17's C API for this patch.  Thanks!  (Better than putting stuff
in llvmjit_wrap.c, if you can get it upstreamed in time.)

I thought I needed to block users from building with too-old clang and
too-new LLVM, but I haven't managed to find a combination that
actually breaks anything.  I wouldn't recommend it, but for example
clang 10 bitcode seems to be inlinable without problems by LLVM 16 on
my system (I didn't use an assert build of LLVM though).  I think that
could be a separate adjustment if we learn that we need to enforce or
document a constraint there.

So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main
branch) against PostgreSQL versions 14, 15, 16.  I've attached the
versions that apply to master and 16, and pushed back-patches to 14
and 15 to public branches if anyone's interested[1].  Back-patching
further seems a bit harder.  I'm quite willing to do it, but ... do we
actually need to, ie does anyone really *require* old PostgreSQL
release branches to work with new LLVM?

(I'll start a separate thread about the related question of when we
get to drop support for old LLVMs.)

One point from an earlier email:

On Sat, Aug 12, 2023 at 6:09 AM Andres Freund <andres@anarazel.de> wrote:
> >  AttributeTemplate(PG_FUNCTION_ARGS)
> >  {
> > +     PGFunction      fp PG_USED_FOR_ASSERTS_ONLY;
> > +
> > +     fp = &AttributeTemplate;

> Other parts of the file do this by putting the functions into
> referenced_functions[], i'd copy that here and below.

Actually here I just wanted to assert that the 3 template functions
match certain function pointer types.  To restate what these functions
are about:  in the JIT code I need the function type, but we have only
the function pointer type, and it is now impossible to go from a
function pointer type to a function type, so I needed to define some
example functions with the right prototype (well, one of them existed
already but I needed more), and then I wanted to assert that they are
assignable to the appropriate function pointer types.  Does that make
sense?

In this version I changed it to what I hope is a more obvious/explicit
expression of that goal:

+       AssertVariableIsOfType(&ExecEvalSubroutineTemplate,
+                              ExecEvalSubroutine);

[1] https://github.com/macdice/postgres/tree/llvm16-14 and -15

Attachment

Re: LLVM 16 (opaque pointers)

From
Devrim Gündüz
Date:
Hi Thomas,

On Thu, 2023-09-21 at 08:22 +1200, Thomas Munro wrote:
> So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main
> branch) against PostgreSQL versions 14, 15, 16.  I've attached the
> versions that apply to master and 16, and pushed back-patches to 14
> and 15 to public branches if anyone's interested[1].  Back-patching
> further seems a bit harder.  I'm quite willing to do it, but ... do we
> actually need to, ie does anyone really *require* old PostgreSQL
> release branches to work with new LLVM?

RHEL releases new LLVM version along with their new minor releases every
6 month, and we have to build older versions with new LLVM each time.
From RHEL point of view, it would be great if we can back-patch back to
v12 :(

Regards,
--
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Thu, Sep 21, 2023 at 12:24 PM Devrim Gündüz <devrim@gunduz.org> wrote:
> On Thu, 2023-09-21 at 08:22 +1200, Thomas Munro wrote:
> > So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main
> > branch) against PostgreSQL versions 14, 15, 16.  I've attached the
> > versions that apply to master and 16, and pushed back-patches to 14
> > and 15 to public branches if anyone's interested[1].  Back-patching
> > further seems a bit harder.  I'm quite willing to do it, but ... do we
> > actually need to, ie does anyone really *require* old PostgreSQL
> > release branches to work with new LLVM?
>
> RHEL releases new LLVM version along with their new minor releases every
> 6 month, and we have to build older versions with new LLVM each time.
> From RHEL point of view, it would be great if we can back-patch back to
> v12 :(

Got it.  OK, I'll work on 12 and 13 now.



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Thu, Sep 21, 2023 at 12:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Sep 21, 2023 at 12:24 PM Devrim Gündüz <devrim@gunduz.org> wrote:
> > RHEL releases new LLVM version along with their new minor releases every
> > 6 month, and we have to build older versions with new LLVM each time.
> > From RHEL point of view, it would be great if we can back-patch back to
> > v12 :(
>
> Got it.  OK, I'll work on 12 and 13 now.

The back-patch to 12 was a little trickier than anticipated, but after
taking a break and trying again I now have PG 12...17 patches that
I've tested against LLVM 10...18 (that's 54 combinations), in every
case only with the clang corresponding to LLVM.

For 12, I decided to back-patch the llvm_types_module variable that
was introduced in 13, to keep the code more similar.

For master, I had to rebase over Daniel's recent commits, which
required re-adding unused variables removed by 2dad308e, and
then changing a bunch of LLVM type constructors like LLVMInt8Type() to
the LLVMInt8TypeInContext(lc, ...) variants following the example of
9dce2203.  Without that, type assertions in my LLVM 18 debug build
would fail (and maybe there could be a leak problem, though I'm not
sure that really applied to integer (non-struct) types).

I've attached only the patches for master, but the 12-16 versions are
available at https://github.com/macdice/postgres/tree/llvm16-$N in
case anyone has comments on those.

Attachment

Re: LLVM 16 (opaque pointers)

From
Ronan Dunklau
Date:
Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit :
> The back-patch to 12 was a little trickier than anticipated, but after
> taking a break and trying again I now have PG 12...17 patches that
> I've tested against LLVM 10...18 (that's 54 combinations), in every
> case only with the clang corresponding to LLVM.

Thank you Thomas for those patches, and the extensive testing, I will run my
own and let you know.

> I've attached only the patches for master, but the 12-16 versions are
> available at https://github.com/macdice/postgres/tree/llvm16-$N in
> case anyone has comments on those.

For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not
used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not
mistaken.

Best regards,

--
Ronan Dunklau






Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-10-11 21:59:50 +1300, Thomas Munro wrote:
> +#else
> +    LLVMPassBuilderOptionsRef options;
> +    LLVMErrorRef err;
> +    int            compile_optlevel;
> +    char       *passes;
> +
> +    if (context->base.flags & PGJIT_OPT3)
> +        compile_optlevel = 3;
> +    else
> +        compile_optlevel = 0;
> +
> +    passes = psprintf("default<O%d>,mem2reg,function(no-op-function),no-op-module",
> +                      compile_optlevel);

I don't think the "function(no-op-function),no-op-module" bit does something
particularly useful?

I also don't think we should add the mem2reg pass outside of -O0 - running it
after a real optimization pipeline doesn't seem useful and might even make the
code worse? mem2reg is included in default<O1> (and obviously also in O3).


Thanks for working on this stuff!


I'm working on setting up buildfarm animals for 16, 17, each once with
a normal and an assertion enabled LLVM build.

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Dmitry Dolgov
Date:
> On Thu, Oct 12, 2023 at 04:31:20PM -0700, Andres Freund wrote:
> Hi,
>
> On 2023-10-11 21:59:50 +1300, Thomas Munro wrote:
> > +#else
> > +    LLVMPassBuilderOptionsRef options;
> > +    LLVMErrorRef err;
> > +    int            compile_optlevel;
> > +    char       *passes;
> > +
> > +    if (context->base.flags & PGJIT_OPT3)
> > +        compile_optlevel = 3;
> > +    else
> > +        compile_optlevel = 0;
> > +
> > +    passes = psprintf("default<O%d>,mem2reg,function(no-op-function),no-op-module",
> > +                      compile_optlevel);
>
> I don't think the "function(no-op-function),no-op-module" bit does something
> particularly useful?

Right, looks like leftovers after verifying which passes were actually
applied. My bad, could be removed.

> I also don't think we should add the mem2reg pass outside of -O0 - running it
> after a real optimization pipeline doesn't seem useful and might even make the
> code worse? mem2reg is included in default<O1> (and obviously also in O3).

My understanding was that while mem2reg is included everywhere above
-O0, this set of passes won't hurt. But yeah, if you say it could
degrade the final result, it's better to not do this. I'll update this
part.



Re: LLVM 16 (opaque pointers)

From
Dmitry Dolgov
Date:
> On Fri, Oct 13, 2023 at 11:06:21AM +0200, Dmitry Dolgov wrote:
> > On Thu, Oct 12, 2023 at 04:31:20PM -0700, Andres Freund wrote:
> > I don't think the "function(no-op-function),no-op-module" bit does something
> > particularly useful?
>
> Right, looks like leftovers after verifying which passes were actually
> applied. My bad, could be removed.
>
> > I also don't think we should add the mem2reg pass outside of -O0 - running it
> > after a real optimization pipeline doesn't seem useful and might even make the
> > code worse? mem2reg is included in default<O1> (and obviously also in O3).
>
> My understanding was that while mem2reg is included everywhere above
> -O0, this set of passes won't hurt. But yeah, if you say it could
> degrade the final result, it's better to not do this. I'll update this
> part.

Here is what I had in mind (only this part in the second patch was changed).

Attachment

Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-10-13 11:06:21 +0200, Dmitry Dolgov wrote:
> > On Thu, Oct 12, 2023 at 04:31:20PM -0700, Andres Freund wrote:
> > I also don't think we should add the mem2reg pass outside of -O0 - running it
> > after a real optimization pipeline doesn't seem useful and might even make the
> > code worse? mem2reg is included in default<O1> (and obviously also in O3).
> 
> My understanding was that while mem2reg is included everywhere above
> -O0, this set of passes won't hurt. But yeah, if you say it could
> degrade the final result, it's better to not do this. I'll update this
> part.

It's indeed included anywhere above that, but adding it explicitly to the
schedule means it's excuted twice:

echo 'int foo(int a) { return a / 343; }' | clang-16 -emit-llvm -x c -c -o - -S -|sed -e 's/optnone//'|opt-17
-debug-pass-manager-passes='default<O1>,mem2reg' -o /dev/null 2>&1|grep Promote
 
Running pass: PromotePass on foo (2 instructions)
Running pass: PromotePass on foo (2 instructions)

The second one is in a point in the pipeline where it doesn't help. It also
requires another analysis pass to be executed unnecessarily.

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
On 2023-10-13 16:44:13 +0200, Dmitry Dolgov wrote:
> Here is what I had in mind (only this part in the second patch was changed).

Makes sense to me. I think we'll likely eventually want to use a custom
pipeline anyway, and I think we should consider using an optimization level
inbetween "not at all" "as hard as possible"...

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Wed, Oct 11, 2023 at 10:31 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote:
> Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit :
> > The back-patch to 12 was a little trickier than anticipated, but after
> > taking a break and trying again I now have PG 12...17 patches that
> > I've tested against LLVM 10...18 (that's 54 combinations), in every
> > case only with the clang corresponding to LLVM.
>
> Thank you Thomas for those patches, and the extensive testing, I will run my
> own and let you know.

Thanks!  No news is good news, I hope?  I'm hoping to commit this today.

> > I've attached only the patches for master, but the 12-16 versions are
> > available at https://github.com/macdice/postgres/tree/llvm16-$N in
> > case anyone has comments on those.
>
> For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not
> used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not
> mistaken.

Right, looks like I can remove that in those branches.



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Sat, Oct 14, 2023 at 3:56 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-10-13 16:44:13 +0200, Dmitry Dolgov wrote:
> > Here is what I had in mind (only this part in the second patch was changed).
>
> Makes sense to me. I think we'll likely eventually want to use a custom
> pipeline anyway, and I think we should consider using an optimization level
> inbetween "not at all" "as hard as possible"...

Thanks Dmitry and Andres.  I'm planning to commit these today if there
are no further comments.



Re: LLVM 16 (opaque pointers)

From
Ronan Dunklau
Date:
Le vendredi 13 octobre 2023, 22:32:13 CEST Thomas Munro a écrit :
> On Wed, Oct 11, 2023 at 10:31 PM Ronan Dunklau <ronan.dunklau@aiven.io>
wrote:
> > Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit :
> > > The back-patch to 12 was a little trickier than anticipated, but after
> > > taking a break and trying again I now have PG 12...17 patches that
> > > I've tested against LLVM 10...18 (that's 54 combinations), in every
> > > case only with the clang corresponding to LLVM.
> >
> > Thank you Thomas for those patches, and the extensive testing, I will run
> > my own and let you know.
>
> Thanks!  No news is good news, I hope?  I'm hoping to commit this today.
>
> > > I've attached only the patches for master, but the 12-16 versions are
> > > available at https://github.com/macdice/postgres/tree/llvm16-$N in
> > > case anyone has comments on those.
> >
> > For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not
> > used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not
> > mistaken.
>
> Right, looks like I can remove that in those branches.

Oh sorry I thought I followed up. I ran the same stress testing involving
several hours of sqlsmith with all jit costs set to zero and didn't notice
anything with LLVM16.

Thank you !

--
Ronan Dunklau






Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
I pushed the first patch, for LLVM 16, and the build farm told me that
some old LLVM versions don't like it.  The problem seems to be the
function LLVMGlobalGetValueType().  I can see that that was only added
to the C API in 2018, so it looks like I may need to back-port that
(trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8.



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Thu, Oct 19, 2023 at 12:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> I pushed the first patch, for LLVM 16, and the build farm told me that
> some old LLVM versions don't like it.  The problem seems to be the
> function LLVMGlobalGetValueType().  I can see that that was only added
> to the C API in 2018, so it looks like I may need to back-port that
> (trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8.

Concretely something like the attached should probably fix it, but
it'll take me a little while to confirm that...

Attachment

Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Thu, Oct 19, 2023 at 1:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Oct 19, 2023 at 12:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I pushed the first patch, for LLVM 16, and the build farm told me that
> > some old LLVM versions don't like it.  The problem seems to be the
> > function LLVMGlobalGetValueType().  I can see that that was only added
> > to the C API in 2018, so it looks like I may need to back-port that
> > (trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8.
>
> Concretely something like the attached should probably fix it, but
> it'll take me a little while to confirm that...

Pushed, after digging up some old LLVM skeletons to test, and those
"old LLVM" animals are turning green now.  I went ahead and pushed the
much smaller and simpler patch for LLVM 17.

Interestingly, a new problem just showed up on the the RHEL9 s390x
machine "lora", where a previously reported problem [1] apparently
re-appeared.  It complains about incompatible layout, previously
blamed on mismatch between clang and LLVM versions.  I can see that
its clang is v15 from clues in the conflig log, but I don't know which
version of LLVM is being used.  However, I see now that --with-llvm
was literally just turned on, so there is no reason to think that this
would have worked before or this work is relevant.  Strange though --
we must be able to JIT further than that on s390x because we have
crash reports in other threads (ie we made it past this and into other
more advanced brokenness).

[1]
https://www.postgresql.org/message-id/flat/20210319190047.7o4bwhbp5dzkqif3%40alap3.anarazel.de#ec51b488ca8eac8c603d91c0439d38b2



Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-10-19 06:20:26 +1300, Thomas Munro wrote:
> On Thu, Oct 19, 2023 at 1:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Thu, Oct 19, 2023 at 12:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > I pushed the first patch, for LLVM 16, and the build farm told me that
> > > some old LLVM versions don't like it.  The problem seems to be the
> > > function LLVMGlobalGetValueType().  I can see that that was only added
> > > to the C API in 2018, so it looks like I may need to back-port that
> > > (trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8.
> >
> > Concretely something like the attached should probably fix it, but
> > it'll take me a little while to confirm that...
> 
> Pushed, after digging up some old LLVM skeletons to test, and those
> "old LLVM" animals are turning green now.  I went ahead and pushed the
> much smaller and simpler patch for LLVM 17.

I enabled a new set of buildfarm animals to test LLVM 16 and 17. I initially
forgot to disable them for 11, which means we'll have those failed build on 11
until they age out :/.

Particularly for the LLVM debug builds it'll take a fair bit to run on all
branches. Each branch takes about 3h.

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Devrim Gündüz
Date:
Hi,

On Thu, 2023-10-19 at 06:20 +1300, Thomas Munro wrote:
> Pushed, after digging up some old LLVM skeletons to test, and those
> "old LLVM" animals are turning green now.  I went ahead and pushed the
> much smaller and simpler patch for LLVM 17.

Thank you! I can confirm that RPMs built fine on Fedora 39 with those
patches, which ships LLVM 17.0.2 as of today.

I can also confirm that builds are not broken on RHEL 9 and 8 and Fedora
37 which ship LLVM 15, and Fedora 38 (LLVM 16).

Thanks again!

Regards,
--
Devrim Gündüz
Open Source Solution Architect, PostgreSQL Major Contributor
Twitter: @DevrimGunduz , @DevrimGunduzTR



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Interestingly, a new problem just showed up on the the RHEL9 s390x
> machine "lora", where a previously reported problem [1] apparently
> re-appeared.  It complains about incompatible layout, previously
> blamed on mismatch between clang and LLVM versions.  I can see that
> its clang is v15 from clues in the conflig log, but I don't know which
> version of LLVM is being used.  However, I see now that --with-llvm
> was literally just turned on, so there is no reason to think that this
> would have worked before or this work is relevant.  Strange though --
> we must be able to JIT further than that on s390x because we have
> crash reports in other threads (ie we made it past this and into other
> more advanced brokenness).

I see that Mark has also just enabled --with-llvm on some POWER Linux
animals, and they have failed in various ways.  The failures are
strangely lacking in detail.  It seems we didn't have coverage before,
and I recall that there were definitely versions of LLVM that *didn't*
work for our usage in the past, which I'll need to dredge out of the
archives.  I will try to get onto a cfarm POWER machine and see if I
can reproduce that, before and after these commits, and whose bug is
it etc.

I doubt I can get anywhere near an s390x though, and we definitely had
pre-existing problems on that arch.



Re: LLVM 16 (opaque pointers)

From
Mark Wong
Date:
On Sat, Oct 21, 2023 at 10:48:47AM +1300, Thomas Munro wrote:
> On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Interestingly, a new problem just showed up on the the RHEL9 s390x
> > machine "lora", where a previously reported problem [1] apparently
> > re-appeared.  It complains about incompatible layout, previously
> > blamed on mismatch between clang and LLVM versions.  I can see that
> > its clang is v15 from clues in the conflig log, but I don't know which
> > version of LLVM is being used.  However, I see now that --with-llvm
> > was literally just turned on, so there is no reason to think that this
> > would have worked before or this work is relevant.  Strange though --
> > we must be able to JIT further than that on s390x because we have
> > crash reports in other threads (ie we made it past this and into other
> > more advanced brokenness).
> 
> I see that Mark has also just enabled --with-llvm on some POWER Linux
> animals, and they have failed in various ways.  The failures are
> strangely lacking in detail.  It seems we didn't have coverage before,
> and I recall that there were definitely versions of LLVM that *didn't*
> work for our usage in the past, which I'll need to dredge out of the
> archives.  I will try to get onto a cfarm POWER machine and see if I
> can reproduce that, before and after these commits, and whose bug is
> it etc.

Yeah, I'm slowing enabling --with-llvm on POWER, s390x, and aarch64 (but
none here yet as I write this)...

> I doubt I can get anywhere near an s390x though, and we definitely had
> pre-existing problems on that arch.

If you want to send me your ssh key, I have access to these systems
through OSUOSL and LinuxFoundation programs.

Regards,
Mark

--
Mark Wong
EDB https://enterprisedb.com



Re: LLVM 16 (opaque pointers)

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I doubt I can get anywhere near an s390x though, and we definitely had
> pre-existing problems on that arch.

Yeah.  Too bad there's no s390x in the gcc compile farm.
(I'm wondering how straight a line to draw between that fact
and llvm's evident shortcomings on s390x.)

I'm missing my old access to Red Hat's dev machines.  But in
the meantime, Mark's clearly got beaucoup access to s390
machines, so I wonder if he can let you into any of them?

            regards, tom lane



Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-10-21 10:48:47 +1300, Thomas Munro wrote:
> On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Interestingly, a new problem just showed up on the the RHEL9 s390x
> > machine "lora", where a previously reported problem [1] apparently
> > re-appeared.  It complains about incompatible layout, previously
> > blamed on mismatch between clang and LLVM versions.  I can see that
> > its clang is v15 from clues in the conflig log, but I don't know which
> > version of LLVM is being used.  However, I see now that --with-llvm
> > was literally just turned on, so there is no reason to think that this
> > would have worked before or this work is relevant.  Strange though --
> > we must be able to JIT further than that on s390x because we have
> > crash reports in other threads (ie we made it past this and into other
> > more advanced brokenness).
> 
> I see that Mark has also just enabled --with-llvm on some POWER Linux
> animals, and they have failed in various ways.  The failures are
> strangely lacking in detail.  It seems we didn't have coverage before,
> and I recall that there were definitely versions of LLVM that *didn't*
> work for our usage in the past, which I'll need to dredge out of the
> archives.  I will try to get onto a cfarm POWER machine and see if I
> can reproduce that, before and after these commits, and whose bug is
> it etc.

I'm quite sure that jiting did pass on ppc64 at some point.


> I doubt I can get anywhere near an s390x though, and we definitely had
> pre-existing problems on that arch.

IMO an LLVM bug, rather than a postgres bug, but I guess it's all a matter of
perspective.
https://github.com/llvm/llvm-project/issues/53009#issuecomment-1042748553

I had made another bug report about this issue at some point, but I can't
refind it right now.

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Sat, Oct 21, 2023 at 11:12 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-10-21 10:48:47 +1300, Thomas Munro wrote:
> > On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I see that Mark has also just enabled --with-llvm on some POWER Linux
> > animals, and they have failed in various ways.  The failures are
> > strangely lacking in detail.  It seems we didn't have coverage before,
> > and I recall that there were definitely versions of LLVM that *didn't*
> > work for our usage in the past, which I'll need to dredge out of the
> > archives.  I will try to get onto a cfarm POWER machine and see if I
> > can reproduce that, before and after these commits, and whose bug is
> > it etc.
>
> I'm quite sure that jiting did pass on ppc64 at some point.

Yeah, I remember debugging it on EDB's POWER machine.  First off, we
know that LLVM < 7 doesn't work for us on POWER, because:

https://www.postgresql.org/message-id/CAEepm%3D39F_B3Ou8S3OrUw%2BhJEUP3p%3DwCu0ug-TTW67qKN53g3w%40mail.gmail.com

That was fixed:

https://github.com/llvm/llvm-project/commit/a95b0df5eddbe7fa1e9f8fe0b1ff62427e1c0318

So I think that means that we'd first have to go through those animals
and figure out which ones have older LLVM, and ignore those results --
they just can't use --with-llvm.  Unfortunately there doesn't seem to
be any clue on the version from the paths used by OpenSUSE.  Mark, do
you know?

> > I doubt I can get anywhere near an s390x though, and we definitely had
> > pre-existing problems on that arch.
>
> IMO an LLVM bug, rather than a postgres bug, but I guess it's all a matter of
> perspective.
> https://github.com/llvm/llvm-project/issues/53009#issuecomment-1042748553

Ah, good to know about that.  But there are also reports of crashes in
released versions that manage to get passed that ABI-wobble business:

https://www.postgresql.org/message-id/flat/CAF1DzPXjpPxnsgySz2Zjm8d2dx7%3DJ070C%2BMQBFh%2B9NBNcBKCAg%40mail.gmail.com



Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-10-21 12:02:51 +1300, Thomas Munro wrote:
> On Sat, Oct 21, 2023 at 11:12 AM Andres Freund <andres@anarazel.de> wrote:
> > > I doubt I can get anywhere near an s390x though, and we definitely had
> > > pre-existing problems on that arch.
> >
> > IMO an LLVM bug, rather than a postgres bug, but I guess it's all a matter of
> > perspective.
> > https://github.com/llvm/llvm-project/issues/53009#issuecomment-1042748553
> 
> Ah, good to know about that.  But there are also reports of crashes in
> released versions that manage to get passed that ABI-wobble business:
> 
>
https://www.postgresql.org/message-id/flat/CAF1DzPXjpPxnsgySz2Zjm8d2dx7%3DJ070C%2BMQBFh%2B9NBNcBKCAg%40mail.gmail.com

Trying to debug that now, using access to an s390x box provided by Mark...

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Sat, Oct 21, 2023 at 12:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Oct 21, 2023 at 11:12 AM Andres Freund <andres@anarazel.de> wrote:
> > I'm quite sure that jiting did pass on ppc64 at some point.
>
> Yeah, I remember debugging it on EDB's POWER machine.  First off, we
> know that LLVM < 7 doesn't work for us on POWER, because:
>
> https://www.postgresql.org/message-id/CAEepm%3D39F_B3Ou8S3OrUw%2BhJEUP3p%3DwCu0ug-TTW67qKN53g3w%40mail.gmail.com
>
> That was fixed:
>
> https://github.com/llvm/llvm-project/commit/a95b0df5eddbe7fa1e9f8fe0b1ff62427e1c0318
>
> So I think that means that we'd first have to go through those animals
> and figure out which ones have older LLVM, and ignore those results --
> they just can't use --with-llvm.  Unfortunately there doesn't seem to
> be any clue on the version from the paths used by OpenSUSE.  Mark, do
> you know?

Adding Mark to this subthread.  Concretely, could you please disable
--with-llvm on any of those machines running LLVM < 7?  And report
what version any remaining animals are running?  (It'd be nice if the
build farm logged "$LLVM_CONFIG --version" somewhere.)  One of them
seems to have clang 5 which is a clue -- if the LLVM is also 5 it's
just not going to work, as LLVM is one of those forwards-only projects
that doesn't back-patch fixes like that.



Re: LLVM 16 (opaque pointers)

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> (It'd be nice if the
> build farm logged "$LLVM_CONFIG --version" somewhere.)

It's not really the buildfarm script's responsibility to do that,
but feel free to make configure do so.

            regards, tom lane



Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-10-19 06:20:26 +1300, Thomas Munro wrote:
> Interestingly, a new problem just showed up on the the RHEL9 s390x
> machine "lora", where a previously reported problem [1] apparently
> re-appeared.  It complains about incompatible layout, previously
> blamed on mismatch between clang and LLVM versions.

I've attached a patch revision that I spent the last couple hours working
on. It's very very roughly based on a patch Tom Stellard had written (which I
think a few rpm packages use). But instead of encoding details about specific
layout details, I made the code check if the data layout works and fall back
to the cpu / features used for llvmjit_types.bc.  This way it's not s390x
specific, future odd architecture behaviour would "automatically" be handled
the same.

With that at least the main regression tests pass on s390x, even with
jit_above_cost=0.


> I can see that its clang is v15 from clues in the conflig log, but I don't
> know which version of LLVM is being used.  However, I see now that
> --with-llvm was literally just turned on, so there is no reason to think
> that this would have worked before or this work is relevant.  Strange though
> -- we must be able to JIT further than that on s390x because we have crash
> reports in other threads (ie we made it past this and into other more
> advanced brokenness).

You can avoid the borkedness by a) running on an older cpu b) adding
compilation flags to change the code generation target
(e.g. -march=native). And some RPM packages have applied the patch by Tom
Stellard.

> [1]
https://www.postgresql.org/message-id/flat/20210319190047.7o4bwhbp5dzkqif3%40alap3.anarazel.de#ec51b488ca8eac8c603d91c0439d38b2

Greetings,

Andres Freund

Attachment

Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Sat, Oct 21, 2023 at 7:08 PM Andres Freund <andres@anarazel.de> wrote:
> I've attached a patch revision that I spent the last couple hours working
> on. It's very very roughly based on a patch Tom Stellard had written (which I
> think a few rpm packages use). But instead of encoding details about specific
> layout details, I made the code check if the data layout works and fall back
> to the cpu / features used for llvmjit_types.bc.  This way it's not s390x
> specific, future odd architecture behaviour would "automatically" be handled
> the same

The explanation makes sense and this seems like a solid plan to deal
with it.  I didn't try on a s390x, but I tested locally on our master
branch with LLVM 7, 10, 17, 18, and then I hacked your patch to take
the fallback path as if a layout mismatch had been detected, and it
worked fine:

2023-10-22 11:49:55.663 NZDT [12000] DEBUG:  detected CPU "skylake",
with features "...", resulting in layout
"e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
2023-10-22 11:49:55.664 NZDT [12000] DEBUG:  detected CPU / features
yield incompatible data layout, using values from module instead
2023-10-22 11:49:55.664 NZDT [12000] DETAIL:  module CPU "x86-64",
features "...", resulting in layout
"e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"

+    To deal with that, check if data layouts match during JIT
initialization. If
+    the runtime detected cpu / features result in a different layout,
try if the
+    cpu/features recorded in in llvmjit_types.bc work.

s|try |check |
s| in in | in |

+                                       errmsg_internal("could not
determine working CPU / feature comination for JIT compilation"),

s|comination|combination|
s| / |/|g



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Sat, Oct 21, 2023 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > (It'd be nice if the
> > build farm logged "$LLVM_CONFIG --version" somewhere.)
>
> It's not really the buildfarm script's responsibility to do that,
> but feel free to make configure do so.

Done, copying the example of how we do it for perl and various other things.



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Sun, Oct 22, 2023 at 2:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Oct 21, 2023 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > (It'd be nice if the
> > > build farm logged "$LLVM_CONFIG --version" somewhere.)
> >
> > It's not really the buildfarm script's responsibility to do that,
> > but feel free to make configure do so.
>
> Done, copying the example of how we do it for perl and various other things.

Build farm measles update:

With that we can see that nicator (LLVM 15 on POWER) is green.  We can
see that cavefish (LLVM 6 on POWER) is red as expected.  We can also
see that bonito (LLVM 7 on POWER) is red, so my earlier theory that
this might be due to the known bug we got fixed in LLVM 7 is not
enough.  Maybe there are other things fixed on POWER somewhere between
those LLVM versions?  I suspect it'll be hard to figure out without
debug builds and backtraces.

One thing is definitely our fault, though.  xenodermus shows failures
on REL_12_STABLE and REL_13_STABLE, using debug LLVM 6 on x86.  I
couldn't reproduce this locally on (newer) debug LLVM, so I bugged
Andres for access to the host/libraries and chased it down.  There is
some type punning for a function parameter REL_13_STABLE and earlier,
removed by Andres in REL_14_STABLE, and when I back-patched my
refactoring I effectively back-patched a few changes from his commit
df99ddc70b97 that removed the type punning, but I should have brought
one more line from that commit to remove another trace of it.  See
attached.

Attachment

Re: LLVM 16 (opaque pointers)

From
Mark Wong
Date:
On Mon, Oct 23, 2023 at 01:15:04PM +1300, Thomas Munro wrote:
> On Sun, Oct 22, 2023 at 2:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Sat, Oct 21, 2023 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > Thomas Munro <thomas.munro@gmail.com> writes:
> > > > (It'd be nice if the
> > > > build farm logged "$LLVM_CONFIG --version" somewhere.)
> > >
> > > It's not really the buildfarm script's responsibility to do that,
> > > but feel free to make configure do so.
> >
> > Done, copying the example of how we do it for perl and various other things.
> 
> Build farm measles update:
> 
> With that we can see that nicator (LLVM 15 on POWER) is green.  We can
> see that cavefish (LLVM 6 on POWER) is red as expected.  We can also
> see that bonito (LLVM 7 on POWER) is red, so my earlier theory that
> this might be due to the known bug we got fixed in LLVM 7 is not
> enough.  Maybe there are other things fixed on POWER somewhere between
> those LLVM versions?  I suspect it'll be hard to figure out without
> debug builds and backtraces.

I haven't gotten around to disabling llvm on any of my animals with llvm
< 7 yet.  Do you still want to hold on that?

> One thing is definitely our fault, though.  xenodermus shows failures
> on REL_12_STABLE and REL_13_STABLE, using debug LLVM 6 on x86.  I
> couldn't reproduce this locally on (newer) debug LLVM, so I bugged
> Andres for access to the host/libraries and chased it down.  There is
> some type punning for a function parameter REL_13_STABLE and earlier,
> removed by Andres in REL_14_STABLE, and when I back-patched my
> refactoring I effectively back-patched a few changes from his commit
> df99ddc70b97 that removed the type punning, but I should have brought
> one more line from that commit to remove another trace of it.  See
> attached.

Here are my list of llvm-config versions and distros for s390x and POWER
(didn't see any issues on aarch64 but I grabbed all the info at the same
time.)

s390x:

branta: 10.0.0 Ubuntu 20.04.4 LTS
cotinga: 6.0.0 Ubuntu 18.04.6 LTS
perch: 6.0.0 Ubuntu 18.04.6 LTS
sarus: 14.0.0 Ubuntu 22.04.1 LTS
aracari: 15.0.7 Red Hat Enterprise Linux 8.6
pipit: 15.0.7 Red Hat Enterprise Linux 8.6
lora: 15.0.7 Red Hat Enterprise Linux 9.2
mamushi: 15.0.7 Red Hat Enterprise Linux 9.2
pike: 11.0.1 Debian GNU/Linux 11
rinkhals: 11.0.1 Debian GNU/Linux 11


POWER:

bonito: 7.0.1 Fedora 29
cavefish: 6.0.0 Ubuntu 18.04.6 LTS
demoiselle: 5.0.1 openSUSE Leap 15.0
elasmobranch: 5.0.1 openSUSE Leap 15.0
babbler: 15.0.7 AlmaLinux 8.8
pytilia: 15.0.7 AlmaLinux 8.8
nicator: 15.0.7 AlmaLinux 9.2
twinspot: 15.0.7 AlmaLinux 9.2
cascabel: 11.0.1 Debian GNU/Linux 11
habu: 16.0.6 Fedora Linux 38
kingsnake: 16.0.6 Fedora Linux 38
krait: CentOS 16.0.6 Stream 8
lancehead: CentOS 16.0.6 Stream 8


aarch64:

boiga: 14.0.5 Fedora Linux 36
corzo: 14.0.5 Fedora Linux 36
desman: 16.0.6 Fedora Linux 38
motmot: 16.0.6 Fedora Linux 38
whinchat: 11.0.1 Debian GNU/Linux 11
jackdaw: 11.0.1 Debian GNU/Linux 11
blackneck: 7.0.1 Debian GNU/Linux 10
alimoche: 7.0.1 Debian GNU/Linux 10
bulbul: 15.0.7 AlmaLinux 8.8
broadbill: 15.0.7 AlmaLinux 8.8
oystercatcher: 15.0.7 AlmaLinux 9.2
potoo: 15.0.7 AlmaLinux 9.2
whiting: 6.0.0 Ubuntu 18.04.5 LTS
vimba: 6.0.0 Ubuntu 18.04.5 LTS
splitfin: 10.0.0 Ubuntu 20.04.6 LTS
rudd: 10.0.0 Ubuntu 20.04.6 LTS
turbot: 14.0.0 Ubuntu 22.04.3 LTS
shiner: 14.0.0 Ubuntu 22.04.3 LTS
ziege: 16.0.6 CentOS Stream 8
chevrotain: 11.0.1 Debian GNU/Linux 11

Regards,
Mark



Re: LLVM 16 (opaque pointers)

From
Thomas Munro
Date:
On Tue, Oct 24, 2023 at 4:27 AM Mark Wong <markwkm@gmail.com> wrote:
> I haven't gotten around to disabling llvm on any of my animals with llvm
> < 7 yet.  Do you still want to hold on that?

Yes, please disable --with-llvm on s390x and POWER animals with LLVM <
7 (see below).  Also, you have a bunch of machines with LLVM 16 that
are failing to compile on REL_11_STABLE.  That is expected, because we
agreed not to back-patch the LLVM 16 API changes into REL_11_STABLE:

> kingsnake: 16.0.6 Fedora Linux 38
> krait: CentOS 16.0.6 Stream 8
> lancehead: CentOS 16.0.6 Stream 8

These POWER machines fail as expected, and it's unfixable:

> elasmobranch: 5.0.1 openSUSE Leap 15.0
> demoiselle: 5.0.1 openSUSE Leap 15.0
> cavefish: 6.0.0 Ubuntu 18.04.6 LTS

(Well, we could in theory supply our own fixed
llvm::orc::createLocalIndirectStubsManagerBuilder() function to hide
the broken one in LLVM <= 6, but that way lies madness IMHO.  An LTS
distro that cares could look into back-patching LLVM's fixes, but for
us, let us focus on current software.)

This POWER animal fails, unexpectedly to me:

> bonito: 7.0.1 Fedora 29

We could try to chase that down, or we could rejoice that at least it
works on current release.  It must begin working somewhere between 7
and 11, but when I checked which LLVM releases I could easily install
on eg cascabel (if I could get access) using the repo at apt.llvm.org,
I saw that they don't even have anything older than 11.  So someone
with access who wants to figure this out might have many days or weeks
of compiling ahead of them.

These POWER animals are passing, as expected:

> cascabel: 11.0.1 Debian GNU/Linux 11
> babbler: 15.0.7 AlmaLinux 8.8
> pytilia: 15.0.7 AlmaLinux 8.8
> nicator: 15.0.7 AlmaLinux 9.2
> twinspot: 15.0.7 AlmaLinux 9.2
> habu: 16.0.6 Fedora Linux 38
> kingsnake: 16.0.6 Fedora Linux 38
> krait: CentOS 16.0.6 Stream 8
> lancehead: CentOS 16.0.6 Stream 8

These s390x animals are passing:

> branta: 10.0.0 Ubuntu 20.04.4 LTS
> pike: 11.0.1 Debian GNU/Linux 11
> rinkhals: 11.0.1 Debian GNU/Linux 11
> sarus: 14.0.0 Ubuntu 22.04.1 LTS

These s390x animals are failing, but don't show the layout complaint.
I can see that LLVM 6 also lacked a case for s390x in
llvm::orc::createLocalIndirectStubsManagerBuilder(), the thing that
was fixed in 7 with the addition of a default case.  Therefore these
presumably fail just like old LLVM on POWER, and it's unfixable.  So I
suggest turning off --with-llvm on these two:

> cotinga: 6.0.0 Ubuntu 18.04.6 LTS
> perch: 6.0.0 Ubuntu 18.04.6 LTS

These s390x animals are failing with the mismatched layout problem,
which should be fixed by Andres's patch to tolerate the changing
z12/z13 ABI thing by falling back to whatever clang picked (at a cost
of not using all the features of your newer CPU, unless you explicitly
tell clang to target it):

> aracari: 15.0.7 Red Hat Enterprise Linux 8.6
> pipit: 15.0.7 Red Hat Enterprise Linux 8.6
> lora: 15.0.7 Red Hat Enterprise Linux 9.2

This s390x animal doesn't actually have --with-llvm enabled so it
passes, but surely it'd be just like lora:

> mamushi: 15.0.7 Red Hat Enterprise Linux 9.2



Re: LLVM 16 (opaque pointers)

From
Mark Wong
Date:
On Tue, Oct 24, 2023 at 10:17:22AM +1300, Thomas Munro wrote:
> On Tue, Oct 24, 2023 at 4:27 AM Mark Wong <markwkm@gmail.com> wrote:
> > I haven't gotten around to disabling llvm on any of my animals with llvm
> > < 7 yet.  Do you still want to hold on that?
> 
> Yes, please disable --with-llvm on s390x and POWER animals with LLVM <
> 7 (see below).  Also, you have a bunch of machines with LLVM 16 that
> are failing to compile on REL_11_STABLE.  That is expected, because we
> agreed not to back-patch the LLVM 16 API changes into REL_11_STABLE:
> 
> > kingsnake: 16.0.6 Fedora Linux 38
> > krait: CentOS 16.0.6 Stream 8
> > lancehead: CentOS 16.0.6 Stream 8

I should have updated these to not use --with-llvm for REL_11_STABLE.

> These POWER machines fail as expected, and it's unfixable:
> 
> > elasmobranch: 5.0.1 openSUSE Leap 15.0
> > demoiselle: 5.0.1 openSUSE Leap 15.0
> > cavefish: 6.0.0 Ubuntu 18.04.6 LTS

These should now be updated to not use --with-llvm at all.

> These s390x animals are failing, but don't show the layout complaint.
> I can see that LLVM 6 also lacked a case for s390x in
> llvm::orc::createLocalIndirectStubsManagerBuilder(), the thing that
> was fixed in 7 with the addition of a default case.  Therefore these
> presumably fail just like old LLVM on POWER, and it's unfixable.  So I
> suggest turning off --with-llvm on these two:
> 
> > cotinga: 6.0.0 Ubuntu 18.04.6 LTS
> > perch: 6.0.0 Ubuntu 18.04.6 LTS

Ok, I should have removed --with-llvm here too.

> This s390x animal doesn't actually have --with-llvm enabled so it
> passes, but surely it'd be just like lora:
> 
> > mamushi: 15.0.7 Red Hat Enterprise Linux 9.2

Oops, I think I added it now.


I think I made all the recommended changes, and trimmed out the lines
where I didn't need to do anything. :)

Andres pointed out to me that my animals aren't set up to collect core
file so I'm also trying to update that too...

Regards,
Mark



Re: LLVM 16 (opaque pointers)

From
Andres Freund
Date:
Hi,

On 2023-10-24 10:17:22 +1300, Thomas Munro wrote:
> This POWER animal fails, unexpectedly to me:
> 
> > bonito: 7.0.1 Fedora 29
> 
> We could try to chase that down, or we could rejoice that at least it
> works on current release.  It must begin working somewhere between 7
> and 11, but when I checked which LLVM releases I could easily install
> on eg cascabel (if I could get access) using the repo at apt.llvm.org,
> I saw that they don't even have anything older than 11.  So someone
> with access who wants to figure this out might have many days or weeks
> of compiling ahead of them.

I could reproduce the failure on bonito. The stack trace is:
#0  0x00007fffb83541e8 in raise () from /lib64/libc.so.6
#1  0x00007fffb833448c in abort () from /lib64/libc.so.6
#2  0x00007fff9c68dd78 in std::__replacement_assert (_file=<optimized out>, _line=<optimized out>, _function=<optimized
out>,_condition=<optimized out>)
 
    at /usr/include/c++/8/ppc64le-redhat-linux/bits/c++config.h:447
#3  0x00007fff9df90838 in std::unique_ptr<llvm::orc::JITCompileCallbackManager,
std::default_delete<llvm::orc::JITCompileCallbackManager>>::operator* (
 
    this=0x1b946cb8) at ../include/llvm/Support/MemAlloc.h:29
#4  llvm::OrcCBindingsStack::OrcCBindingsStack(llvm::TargetMachine&,
std::function<std::unique_ptr<llvm::orc::IndirectStubsManager,std::default_delete<llvm::orc::IndirectStubsManager> >
()>)(this=0x1b946be0, TM=..., IndirectStubsMgrBuilder=...) at ../lib/ExecutionEngine/Orc/OrcCBindingsStack.h:242
 
#5  0x00007fff9df90940 in LLVMOrcCreateInstance (TM=0x1b933ae0) at /usr/include/c++/8/bits/move.h:182
#6  0x00007fffa0618f8c in llvm_session_initialize () at /home/andres/src/postgres/src/backend/jit/llvm/llvmjit.c:981
#7  0x00007fffa06179a8 in llvm_create_context (jitFlags=25) at
/home/andres/src/postgres/src/backend/jit/llvm/llvmjit.c:219
#8  0x00007fffa0626cbc in llvm_compile_expr (state=0x1b8ef390) at
/home/andres/src/postgres/src/backend/jit/llvm/llvmjit_expr.c:142
#9  0x0000000010a76fc8 in jit_compile_expr (state=0x1b8ef390) at /home/andres/src/postgres/src/backend/jit/jit.c:177
#10 0x0000000010404550 in ExecReadyExpr (state=0x1b8ef390) at
/home/andres/src/postgres/src/backend/executor/execExpr.c:875

with this assertion message printed:
/usr/include/c++/8/bits/unique_ptr.h:328: typename std::add_lvalue_reference<_Tp>::type std::unique_ptr<_Tp,
Dp>::operator*()const [with Tp = llvm::orc::JITCompileCallbackManager; _Dp =
std::default_delete<llvm::orc::JITCompileCallbackManager>;typename std::add_lvalue_reference<_Tp>::type =
llvm::orc::JITCompileCallbackManager&]:Assertion 'get() != pointer()' failed.
 


I wanted to use a debug build to investigate in more detail, but bonito is a
small VM. Thus I built llvm 7 on a more powerful gcc cfarm machine, running on
AlmaLinux 9.2.  The problem doesn't reproduce there.

Given the crash in some c++ standard library code, that the fc29 patches to
llvm look harmless, that building/using llvm 7 on a newer distro does not show
issues on PPC, it seems likely that this is a compiler / standard library
issue.

FC 29 is well out of support, so I don't think it makes sense to invest any
further time in this. Personally, I don't think it's useful to have years old
fedora in the buildfarm...

Greetings,

Andres Freund



Re: LLVM 16 (opaque pointers)

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> FC 29 is well out of support, so I don't think it makes sense to invest any
> further time in this. Personally, I don't think it's useful to have years old
> fedora in the buildfarm...

+1.  It's good to test old LTS distros, but Fedora releases have a
short shelf life by design.

            regards, tom lane



Re: LLVM 16 (opaque pointers)

From
Mark Wong
Date:
On Mon, Oct 23, 2023 at 10:47:24PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > FC 29 is well out of support, so I don't think it makes sense to invest any
> > further time in this. Personally, I don't think it's useful to have years old
> > fedora in the buildfarm...
> 
> +1.  It's good to test old LTS distros, but Fedora releases have a
> short shelf life by design.

I'll start retiring those old Fedora ones I have. :)

Regards,
Mark