Thread: LLVM 16 (opaque pointers)
Hi, Here is a draft version of the long awaited patch to support LLVM 16. It's mostly mechanical donkeywork, but it took some time: this donkey found it quite hard to understand the mighty getelementptr instruction[1] and the code generation well enough to figure out all the right types, and small mistakes took some debugging effort*. I now finally have a patch that passes all tests. Though it's not quite ready yet, I thought I should give this status update to report that the main task is more or less complete, since we're starting to get quite a few emails about it (mostly from Fedora users) and there is an entry for it on the Open Items for 16 wiki page. Comments/review/testing welcome. Here are some things I think I need to do next (probably after PGCon): 1. If you use non-matching clang and LLVM versions I think we might use "clang -no-opaque-pointers" at the wrong times (I've not looked into that interaction yet). 2. The treatment of function types is a bit inconsistent/messy and could be tidied up. 3. There are quite a lot of extra function calls that could perhaps be elided (ie type variables instead of LLVMTypeInt8(), and calls to LLVMStructGetTypeAtIndex() that are not used in LLVM < 16). 4. Could use some comments. 5. I need to test with very old versions of LLVM and Clang that we claim to support (I have several years' worth of releases around but nothing older than 9). 6. I need to go through the types again with a fine tooth comb, and check the test coverage to look out for eg GEP array arithmetic with the wrong type/size that isn't being exercised. *For anyone working with this type of IR generation code and questioning their sanity, I can pass on some excellent advice I got from Andres: build LLVM yourself with assertions enabled, as they catch some classes of silly mistake that otherwise just segfault inscrutably on execution. [1] https://llvm.org/docs/GetElementPtr.html
Attachment
Oh, one important thing I forgot to mention: that patch is for LLVM 16 only, and I developed it with a local build of their "release/16.x" branch on a FreeBSD box, and also tested with a released package for 16 on a Debian box. Further changes are already needed for their "main" branch (LLVM 17-to-be), so this won't quite be enough to shut seawasp up. At a glance, we will need to change from the "old pass manager" API that has recently been vaporised[1] (llvm-c/Transforms/PassManagerBuilder.h) to the new one[2][3] (llvm-c/Transforms/PassBuilder.h), which I suspect/hope will be as simple as changing llvmjit.c to call LLVMRunPasses() with a string describing the passes we want in "opt -passes" format, instead of our code that calls LLVMAddFunctionInlingPass() etc. But that'll be a topic for another day, and another thread. [1] https://github.com/llvm/llvm-project/commit/0aac9a2875bad4f065367e4a6553fad78605f895 [2] https://llvm.org/docs/NewPassManager.html [3] https://reviews.llvm.org/D102136
> On Mon, May 22, 2023 at 03:38:44PM +1200, Thomas Munro wrote: > Further changes are already needed for their "main" branch (LLVM > 17-to-be), so this won't quite be enough to shut seawasp up. At a > glance, we will need to change from the "old pass manager" API that > has recently been vaporised[1] > (llvm-c/Transforms/PassManagerBuilder.h) to the new one[2][3] > (llvm-c/Transforms/PassBuilder.h), which I suspect/hope will be as > simple as changing llvmjit.c to call LLVMRunPasses() with a string > describing the passes we want in "opt -passes" format, instead of our > code that calls LLVMAddFunctionInlingPass() etc. But that'll be a > topic for another day, and another thread. > > [1] https://github.com/llvm/llvm-project/commit/0aac9a2875bad4f065367e4a6553fad78605f895 > [2] https://llvm.org/docs/NewPassManager.html > [3] https://reviews.llvm.org/D102136 Thanks for tackling the topic! I've tested it with a couple of versions, LLVM 12 that comes with my Gentoo box, LLVM 15 build from sources and the modified version of patch adopted for LLVM 17 (build form sources as well). In all three cases everything seems to be working fine. Simple benchmarking with a query stolen from some other jit thread (pgbench running single client with multiple unions of selects a-la SELECT a, count(*), sum(b) FROM test WHERE c = 2 GROUP BY a) show some slight performance differences, but nothing dramatic so far. LLVM 17 version produces the lowest latency, with faster generation, inlining and optimization, but slower emission time. LLVM 12 version produces the largest latencies with everything except emission timings being slower. LLVM 15 is somewhere in between. I'll continue reviewing and, for the records, attach adjustments I was using for LLVM 17 (purely for testing, not taking into account other versions), in case if I've missed something.
Attachment
Le dimanche 21 mai 2023, 05:01:41 CEST Thomas Munro a écrit : > Hi, > > Here is a draft version of the long awaited patch to support LLVM 16. > It's mostly mechanical donkeywork, but it took some time: this donkey > found it quite hard to understand the mighty getelementptr > instruction[1] and the code generation well enough to figure out all > the right types, and small mistakes took some debugging effort*. I > now finally have a patch that passes all tests. > > Though it's not quite ready yet, I thought I should give this status > update to report that the main task is more or less complete, since > we're starting to get quite a few emails about it (mostly from Fedora > users) and there is an entry for it on the Open Items for 16 wiki > page. Comments/review/testing welcome. Hello Thomas, Thank you for this effort ! I've tested it against llvm 15 and 16, and found no problem with it. > 6. I need to go through the types again with a fine tooth comb, and > check the test coverage to look out for eg GEP array arithmetic with > the wrong type/size that isn't being exercised. I haven't gone through the test coverage myself, but I exercised the following things: - running make installcheck with jit_above_cost = 0 - letting sqlsmith hammer random queries at it for a few hours. This didn't show obvious issues. > *For anyone working with this type of IR generation code and > questioning their sanity, I can pass on some excellent advice I got > from Andres: build LLVM yourself with assertions enabled, as they > catch some classes of silly mistake that otherwise just segfault > > inscrutably on execution. I tried my hand at backporting it to previous versions, and not knowing anything about it made me indeed question my sanity. It's quite easy for PG 15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier most notably due to to the change in fcinfo args representation. But I guess that's also a topic for another day :-) Best regards, -- Ronan Dunklau
Hi, On 2023-08-10 16:56:54 +0200, Ronan Dunklau wrote: > I tried my hand at backporting it to previous versions, and not knowing > anything about it made me indeed question my sanity. It's quite easy for PG > 15, 14, 13. PG 12 is nothing insurmontable either, but PG 11 is a bit hairier > most notably due to to the change in fcinfo args representation. But I guess > that's also a topic for another day :-) Given that 11 is about to be EOL, I don't think it's worth spending the time to support a new LLVM version for it. Greetings, Andres Freund
Hi, On 2023-05-21 15:01:41 +1200, Thomas Munro wrote: > *For anyone working with this type of IR generation code and > questioning their sanity, I can pass on some excellent advice I got > from Andres: build LLVM yourself with assertions enabled, as they > catch some classes of silly mistake that otherwise just segfault > inscrutably on execution. Hm. I think we need a buildfarm animal with an assertion enabled llvm 16 once we merge this. I think after an upgrade my buildfarm machine has the necessary resources. > @@ -150,7 +150,7 @@ llvm_compile_expr(ExprState *state) > > /* create function */ > eval_fn = LLVMAddFunction(mod, funcname, > - llvm_pg_var_func_type("TypeExprStateEvalFunc")); > + llvm_pg_var_func_type("ExecInterpExprStillValid")); Hm, that's a bit ugly. But ... > @@ -77,9 +80,44 @@ extern Datum AttributeTemplate(PG_FUNCTION_ARGS); > Datum > AttributeTemplate(PG_FUNCTION_ARGS) > { > + PGFunction fp PG_USED_FOR_ASSERTS_ONLY; > + > + fp = &AttributeTemplate; > PG_RETURN_NULL(); > } Other parts of the file do this by putting the functions into referenced_functions[], i'd copy that here and below. > +void > +ExecEvalSubroutineTemplate(ExprState *state, > + struct ExprEvalStep *op, > + ExprContext *econtext) > +{ > + ExecEvalSubroutine fp PG_USED_FOR_ASSERTS_ONLY; > + > + fp = &ExecEvalSubroutineTemplate; > +} > + > +extern bool ExecEvalBoolSubroutineTemplate(ExprState *state, > + struct ExprEvalStep *op, > + ExprContext *econtext); > +bool > +ExecEvalBoolSubroutineTemplate(ExprState *state, > + struct ExprEvalStep *op, > + ExprContext *econtext) > +{ > + ExecEvalBoolSubroutine fp PG_USED_FOR_ASSERTS_ONLY; > + > + fp = &ExecEvalBoolSubroutineTemplate; > + return false; > +} > + Thanks for working on this! Greetings, Andres Freund
> [...] Further changes are already needed for their "main" branch (LLVM > 17-to-be), so this won't quite be enough to shut seawasp up. For information, the physical server which was hosting my 2 bf animals (seawasp and moonjelly) has given up rebooting after a power cut a few weeks/months ago, and I have not setup a replacement (yet). -- Fabien.
Belated thanks Dmitry, Ronan, Andres for your feedback. Here's a new version, also including Dmitry's patch for 17 which it is now also time to push. It required a bit more trivial #if magic to be conditional, as Dmitry already mentioned. I just noticed that Dmitry had the LLVMPassBuilderOptionsSetInlinerThreshold() function added to LLVM 17's C API for this patch. Thanks! (Better than putting stuff in llvmjit_wrap.c, if you can get it upstreamed in time.) I thought I needed to block users from building with too-old clang and too-new LLVM, but I haven't managed to find a combination that actually breaks anything. I wouldn't recommend it, but for example clang 10 bitcode seems to be inlinable without problems by LLVM 16 on my system (I didn't use an assert build of LLVM though). I think that could be a separate adjustment if we learn that we need to enforce or document a constraint there. So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main branch) against PostgreSQL versions 14, 15, 16. I've attached the versions that apply to master and 16, and pushed back-patches to 14 and 15 to public branches if anyone's interested[1]. Back-patching further seems a bit harder. I'm quite willing to do it, but ... do we actually need to, ie does anyone really *require* old PostgreSQL release branches to work with new LLVM? (I'll start a separate thread about the related question of when we get to drop support for old LLVMs.) One point from an earlier email: On Sat, Aug 12, 2023 at 6:09 AM Andres Freund <andres@anarazel.de> wrote: > > AttributeTemplate(PG_FUNCTION_ARGS) > > { > > + PGFunction fp PG_USED_FOR_ASSERTS_ONLY; > > + > > + fp = &AttributeTemplate; > Other parts of the file do this by putting the functions into > referenced_functions[], i'd copy that here and below. Actually here I just wanted to assert that the 3 template functions match certain function pointer types. To restate what these functions are about: in the JIT code I need the function type, but we have only the function pointer type, and it is now impossible to go from a function pointer type to a function type, so I needed to define some example functions with the right prototype (well, one of them existed already but I needed more), and then I wanted to assert that they are assignable to the appropriate function pointer types. Does that make sense? In this version I changed it to what I hope is a more obvious/explicit expression of that goal: + AssertVariableIsOfType(&ExecEvalSubroutineTemplate, + ExecEvalSubroutine); [1] https://github.com/macdice/postgres/tree/llvm16-14 and -15
Attachment
Hi Thomas, On Thu, 2023-09-21 at 08:22 +1200, Thomas Munro wrote: > So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main > branch) against PostgreSQL versions 14, 15, 16. I've attached the > versions that apply to master and 16, and pushed back-patches to 14 > and 15 to public branches if anyone's interested[1]. Back-patching > further seems a bit harder. I'm quite willing to do it, but ... do we > actually need to, ie does anyone really *require* old PostgreSQL > release branches to work with new LLVM? RHEL releases new LLVM version along with their new minor releases every 6 month, and we have to build older versions with new LLVM each time. From RHEL point of view, it would be great if we can back-patch back to v12 :( Regards, -- Devrim Gündüz Open Source Solution Architect, PostgreSQL Major Contributor Twitter: @DevrimGunduz , @DevrimGunduzTR
On Thu, Sep 21, 2023 at 12:24 PM Devrim Gündüz <devrim@gunduz.org> wrote: > On Thu, 2023-09-21 at 08:22 +1200, Thomas Munro wrote: > > So far I've tested LLVM versions 10, 15, 16, 17, 18 (= their main > > branch) against PostgreSQL versions 14, 15, 16. I've attached the > > versions that apply to master and 16, and pushed back-patches to 14 > > and 15 to public branches if anyone's interested[1]. Back-patching > > further seems a bit harder. I'm quite willing to do it, but ... do we > > actually need to, ie does anyone really *require* old PostgreSQL > > release branches to work with new LLVM? > > RHEL releases new LLVM version along with their new minor releases every > 6 month, and we have to build older versions with new LLVM each time. > From RHEL point of view, it would be great if we can back-patch back to > v12 :( Got it. OK, I'll work on 12 and 13 now.
On Thu, Sep 21, 2023 at 12:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Sep 21, 2023 at 12:24 PM Devrim Gündüz <devrim@gunduz.org> wrote: > > RHEL releases new LLVM version along with their new minor releases every > > 6 month, and we have to build older versions with new LLVM each time. > > From RHEL point of view, it would be great if we can back-patch back to > > v12 :( > > Got it. OK, I'll work on 12 and 13 now. The back-patch to 12 was a little trickier than anticipated, but after taking a break and trying again I now have PG 12...17 patches that I've tested against LLVM 10...18 (that's 54 combinations), in every case only with the clang corresponding to LLVM. For 12, I decided to back-patch the llvm_types_module variable that was introduced in 13, to keep the code more similar. For master, I had to rebase over Daniel's recent commits, which required re-adding unused variables removed by 2dad308e, and then changing a bunch of LLVM type constructors like LLVMInt8Type() to the LLVMInt8TypeInContext(lc, ...) variants following the example of 9dce2203. Without that, type assertions in my LLVM 18 debug build would fail (and maybe there could be a leak problem, though I'm not sure that really applied to integer (non-struct) types). I've attached only the patches for master, but the 12-16 versions are available at https://github.com/macdice/postgres/tree/llvm16-$N in case anyone has comments on those.
Attachment
Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit : > The back-patch to 12 was a little trickier than anticipated, but after > taking a break and trying again I now have PG 12...17 patches that > I've tested against LLVM 10...18 (that's 54 combinations), in every > case only with the clang corresponding to LLVM. Thank you Thomas for those patches, and the extensive testing, I will run my own and let you know. > I've attached only the patches for master, but the 12-16 versions are > available at https://github.com/macdice/postgres/tree/llvm16-$N in > case anyone has comments on those. For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not mistaken. Best regards, -- Ronan Dunklau
Hi, On 2023-10-11 21:59:50 +1300, Thomas Munro wrote: > +#else > + LLVMPassBuilderOptionsRef options; > + LLVMErrorRef err; > + int compile_optlevel; > + char *passes; > + > + if (context->base.flags & PGJIT_OPT3) > + compile_optlevel = 3; > + else > + compile_optlevel = 0; > + > + passes = psprintf("default<O%d>,mem2reg,function(no-op-function),no-op-module", > + compile_optlevel); I don't think the "function(no-op-function),no-op-module" bit does something particularly useful? I also don't think we should add the mem2reg pass outside of -O0 - running it after a real optimization pipeline doesn't seem useful and might even make the code worse? mem2reg is included in default<O1> (and obviously also in O3). Thanks for working on this stuff! I'm working on setting up buildfarm animals for 16, 17, each once with a normal and an assertion enabled LLVM build. Greetings, Andres Freund
> On Thu, Oct 12, 2023 at 04:31:20PM -0700, Andres Freund wrote: > Hi, > > On 2023-10-11 21:59:50 +1300, Thomas Munro wrote: > > +#else > > + LLVMPassBuilderOptionsRef options; > > + LLVMErrorRef err; > > + int compile_optlevel; > > + char *passes; > > + > > + if (context->base.flags & PGJIT_OPT3) > > + compile_optlevel = 3; > > + else > > + compile_optlevel = 0; > > + > > + passes = psprintf("default<O%d>,mem2reg,function(no-op-function),no-op-module", > > + compile_optlevel); > > I don't think the "function(no-op-function),no-op-module" bit does something > particularly useful? Right, looks like leftovers after verifying which passes were actually applied. My bad, could be removed. > I also don't think we should add the mem2reg pass outside of -O0 - running it > after a real optimization pipeline doesn't seem useful and might even make the > code worse? mem2reg is included in default<O1> (and obviously also in O3). My understanding was that while mem2reg is included everywhere above -O0, this set of passes won't hurt. But yeah, if you say it could degrade the final result, it's better to not do this. I'll update this part.
> On Fri, Oct 13, 2023 at 11:06:21AM +0200, Dmitry Dolgov wrote: > > On Thu, Oct 12, 2023 at 04:31:20PM -0700, Andres Freund wrote: > > I don't think the "function(no-op-function),no-op-module" bit does something > > particularly useful? > > Right, looks like leftovers after verifying which passes were actually > applied. My bad, could be removed. > > > I also don't think we should add the mem2reg pass outside of -O0 - running it > > after a real optimization pipeline doesn't seem useful and might even make the > > code worse? mem2reg is included in default<O1> (and obviously also in O3). > > My understanding was that while mem2reg is included everywhere above > -O0, this set of passes won't hurt. But yeah, if you say it could > degrade the final result, it's better to not do this. I'll update this > part. Here is what I had in mind (only this part in the second patch was changed).
Attachment
Hi, On 2023-10-13 11:06:21 +0200, Dmitry Dolgov wrote: > > On Thu, Oct 12, 2023 at 04:31:20PM -0700, Andres Freund wrote: > > I also don't think we should add the mem2reg pass outside of -O0 - running it > > after a real optimization pipeline doesn't seem useful and might even make the > > code worse? mem2reg is included in default<O1> (and obviously also in O3). > > My understanding was that while mem2reg is included everywhere above > -O0, this set of passes won't hurt. But yeah, if you say it could > degrade the final result, it's better to not do this. I'll update this > part. It's indeed included anywhere above that, but adding it explicitly to the schedule means it's excuted twice: echo 'int foo(int a) { return a / 343; }' | clang-16 -emit-llvm -x c -c -o - -S -|sed -e 's/optnone//'|opt-17 -debug-pass-manager-passes='default<O1>,mem2reg' -o /dev/null 2>&1|grep Promote Running pass: PromotePass on foo (2 instructions) Running pass: PromotePass on foo (2 instructions) The second one is in a point in the pipeline where it doesn't help. It also requires another analysis pass to be executed unnecessarily. Greetings, Andres Freund
On 2023-10-13 16:44:13 +0200, Dmitry Dolgov wrote: > Here is what I had in mind (only this part in the second patch was changed). Makes sense to me. I think we'll likely eventually want to use a custom pipeline anyway, and I think we should consider using an optimization level inbetween "not at all" "as hard as possible"... Greetings, Andres Freund
On Wed, Oct 11, 2023 at 10:31 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote: > Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit : > > The back-patch to 12 was a little trickier than anticipated, but after > > taking a break and trying again I now have PG 12...17 patches that > > I've tested against LLVM 10...18 (that's 54 combinations), in every > > case only with the clang corresponding to LLVM. > > Thank you Thomas for those patches, and the extensive testing, I will run my > own and let you know. Thanks! No news is good news, I hope? I'm hoping to commit this today. > > I've attached only the patches for master, but the 12-16 versions are > > available at https://github.com/macdice/postgres/tree/llvm16-$N in > > case anyone has comments on those. > > For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not > used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not > mistaken. Right, looks like I can remove that in those branches.
On Sat, Oct 14, 2023 at 3:56 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-10-13 16:44:13 +0200, Dmitry Dolgov wrote: > > Here is what I had in mind (only this part in the second patch was changed). > > Makes sense to me. I think we'll likely eventually want to use a custom > pipeline anyway, and I think we should consider using an optimization level > inbetween "not at all" "as hard as possible"... Thanks Dmitry and Andres. I'm planning to commit these today if there are no further comments.
Le vendredi 13 octobre 2023, 22:32:13 CEST Thomas Munro a écrit : > On Wed, Oct 11, 2023 at 10:31 PM Ronan Dunklau <ronan.dunklau@aiven.io> wrote: > > Le mercredi 11 octobre 2023, 10:59:50 CEST Thomas Munro a écrit : > > > The back-patch to 12 was a little trickier than anticipated, but after > > > taking a break and trying again I now have PG 12...17 patches that > > > I've tested against LLVM 10...18 (that's 54 combinations), in every > > > case only with the clang corresponding to LLVM. > > > > Thank you Thomas for those patches, and the extensive testing, I will run > > my own and let you know. > > Thanks! No news is good news, I hope? I'm hoping to commit this today. > > > > I've attached only the patches for master, but the 12-16 versions are > > > available at https://github.com/macdice/postgres/tree/llvm16-$N in > > > case anyone has comments on those. > > > > For PG13 and PG12, it looks like the ExecEvalBoolSubroutineTemplate is not > > used anywhere, as ExecEvalBoolSubroutine was introduced in PG14 if I'm not > > mistaken. > > Right, looks like I can remove that in those branches. Oh sorry I thought I followed up. I ran the same stress testing involving several hours of sqlsmith with all jit costs set to zero and didn't notice anything with LLVM16. Thank you ! -- Ronan Dunklau
I pushed the first patch, for LLVM 16, and the build farm told me that some old LLVM versions don't like it. The problem seems to be the function LLVMGlobalGetValueType(). I can see that that was only added to the C API in 2018, so it looks like I may need to back-port that (trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8.
On Thu, Oct 19, 2023 at 12:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > I pushed the first patch, for LLVM 16, and the build farm told me that > some old LLVM versions don't like it. The problem seems to be the > function LLVMGlobalGetValueType(). I can see that that was only added > to the C API in 2018, so it looks like I may need to back-port that > (trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8. Concretely something like the attached should probably fix it, but it'll take me a little while to confirm that...
Attachment
On Thu, Oct 19, 2023 at 1:06 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Oct 19, 2023 at 12:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > I pushed the first patch, for LLVM 16, and the build farm told me that > > some old LLVM versions don't like it. The problem seems to be the > > function LLVMGlobalGetValueType(). I can see that that was only added > > to the C API in 2018, so it looks like I may need to back-port that > > (trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8. > > Concretely something like the attached should probably fix it, but > it'll take me a little while to confirm that... Pushed, after digging up some old LLVM skeletons to test, and those "old LLVM" animals are turning green now. I went ahead and pushed the much smaller and simpler patch for LLVM 17. Interestingly, a new problem just showed up on the the RHEL9 s390x machine "lora", where a previously reported problem [1] apparently re-appeared. It complains about incompatible layout, previously blamed on mismatch between clang and LLVM versions. I can see that its clang is v15 from clues in the conflig log, but I don't know which version of LLVM is being used. However, I see now that --with-llvm was literally just turned on, so there is no reason to think that this would have worked before or this work is relevant. Strange though -- we must be able to JIT further than that on s390x because we have crash reports in other threads (ie we made it past this and into other more advanced brokenness). [1] https://www.postgresql.org/message-id/flat/20210319190047.7o4bwhbp5dzkqif3%40alap3.anarazel.de#ec51b488ca8eac8c603d91c0439d38b2
Hi, On 2023-10-19 06:20:26 +1300, Thomas Munro wrote: > On Thu, Oct 19, 2023 at 1:06 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Thu, Oct 19, 2023 at 12:02 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > > I pushed the first patch, for LLVM 16, and the build farm told me that > > > some old LLVM versions don't like it. The problem seems to be the > > > function LLVMGlobalGetValueType(). I can see that that was only added > > > to the C API in 2018, so it looks like I may need to back-port that > > > (trivial) wrapper into our own llvmjit_wrap.cpp for LLVM < 8. > > > > Concretely something like the attached should probably fix it, but > > it'll take me a little while to confirm that... > > Pushed, after digging up some old LLVM skeletons to test, and those > "old LLVM" animals are turning green now. I went ahead and pushed the > much smaller and simpler patch for LLVM 17. I enabled a new set of buildfarm animals to test LLVM 16 and 17. I initially forgot to disable them for 11, which means we'll have those failed build on 11 until they age out :/. Particularly for the LLVM debug builds it'll take a fair bit to run on all branches. Each branch takes about 3h. Greetings, Andres Freund
Hi, On Thu, 2023-10-19 at 06:20 +1300, Thomas Munro wrote: > Pushed, after digging up some old LLVM skeletons to test, and those > "old LLVM" animals are turning green now. I went ahead and pushed the > much smaller and simpler patch for LLVM 17. Thank you! I can confirm that RPMs built fine on Fedora 39 with those patches, which ships LLVM 17.0.2 as of today. I can also confirm that builds are not broken on RHEL 9 and 8 and Fedora 37 which ship LLVM 15, and Fedora 38 (LLVM 16). Thanks again! Regards, -- Devrim Gündüz Open Source Solution Architect, PostgreSQL Major Contributor Twitter: @DevrimGunduz , @DevrimGunduzTR
On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Interestingly, a new problem just showed up on the the RHEL9 s390x > machine "lora", where a previously reported problem [1] apparently > re-appeared. It complains about incompatible layout, previously > blamed on mismatch between clang and LLVM versions. I can see that > its clang is v15 from clues in the conflig log, but I don't know which > version of LLVM is being used. However, I see now that --with-llvm > was literally just turned on, so there is no reason to think that this > would have worked before or this work is relevant. Strange though -- > we must be able to JIT further than that on s390x because we have > crash reports in other threads (ie we made it past this and into other > more advanced brokenness). I see that Mark has also just enabled --with-llvm on some POWER Linux animals, and they have failed in various ways. The failures are strangely lacking in detail. It seems we didn't have coverage before, and I recall that there were definitely versions of LLVM that *didn't* work for our usage in the past, which I'll need to dredge out of the archives. I will try to get onto a cfarm POWER machine and see if I can reproduce that, before and after these commits, and whose bug is it etc. I doubt I can get anywhere near an s390x though, and we definitely had pre-existing problems on that arch.
On Sat, Oct 21, 2023 at 10:48:47AM +1300, Thomas Munro wrote: > On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Interestingly, a new problem just showed up on the the RHEL9 s390x > > machine "lora", where a previously reported problem [1] apparently > > re-appeared. It complains about incompatible layout, previously > > blamed on mismatch between clang and LLVM versions. I can see that > > its clang is v15 from clues in the conflig log, but I don't know which > > version of LLVM is being used. However, I see now that --with-llvm > > was literally just turned on, so there is no reason to think that this > > would have worked before or this work is relevant. Strange though -- > > we must be able to JIT further than that on s390x because we have > > crash reports in other threads (ie we made it past this and into other > > more advanced brokenness). > > I see that Mark has also just enabled --with-llvm on some POWER Linux > animals, and they have failed in various ways. The failures are > strangely lacking in detail. It seems we didn't have coverage before, > and I recall that there were definitely versions of LLVM that *didn't* > work for our usage in the past, which I'll need to dredge out of the > archives. I will try to get onto a cfarm POWER machine and see if I > can reproduce that, before and after these commits, and whose bug is > it etc. Yeah, I'm slowing enabling --with-llvm on POWER, s390x, and aarch64 (but none here yet as I write this)... > I doubt I can get anywhere near an s390x though, and we definitely had > pre-existing problems on that arch. If you want to send me your ssh key, I have access to these systems through OSUOSL and LinuxFoundation programs. Regards, Mark -- Mark Wong EDB https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > I doubt I can get anywhere near an s390x though, and we definitely had > pre-existing problems on that arch. Yeah. Too bad there's no s390x in the gcc compile farm. (I'm wondering how straight a line to draw between that fact and llvm's evident shortcomings on s390x.) I'm missing my old access to Red Hat's dev machines. But in the meantime, Mark's clearly got beaucoup access to s390 machines, so I wonder if he can let you into any of them? regards, tom lane
Hi, On 2023-10-21 10:48:47 +1300, Thomas Munro wrote: > On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > Interestingly, a new problem just showed up on the the RHEL9 s390x > > machine "lora", where a previously reported problem [1] apparently > > re-appeared. It complains about incompatible layout, previously > > blamed on mismatch between clang and LLVM versions. I can see that > > its clang is v15 from clues in the conflig log, but I don't know which > > version of LLVM is being used. However, I see now that --with-llvm > > was literally just turned on, so there is no reason to think that this > > would have worked before or this work is relevant. Strange though -- > > we must be able to JIT further than that on s390x because we have > > crash reports in other threads (ie we made it past this and into other > > more advanced brokenness). > > I see that Mark has also just enabled --with-llvm on some POWER Linux > animals, and they have failed in various ways. The failures are > strangely lacking in detail. It seems we didn't have coverage before, > and I recall that there were definitely versions of LLVM that *didn't* > work for our usage in the past, which I'll need to dredge out of the > archives. I will try to get onto a cfarm POWER machine and see if I > can reproduce that, before and after these commits, and whose bug is > it etc. I'm quite sure that jiting did pass on ppc64 at some point. > I doubt I can get anywhere near an s390x though, and we definitely had > pre-existing problems on that arch. IMO an LLVM bug, rather than a postgres bug, but I guess it's all a matter of perspective. https://github.com/llvm/llvm-project/issues/53009#issuecomment-1042748553 I had made another bug report about this issue at some point, but I can't refind it right now. Greetings, Andres Freund
On Sat, Oct 21, 2023 at 11:12 AM Andres Freund <andres@anarazel.de> wrote: > On 2023-10-21 10:48:47 +1300, Thomas Munro wrote: > > On Thu, Oct 19, 2023 at 6:20 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > I see that Mark has also just enabled --with-llvm on some POWER Linux > > animals, and they have failed in various ways. The failures are > > strangely lacking in detail. It seems we didn't have coverage before, > > and I recall that there were definitely versions of LLVM that *didn't* > > work for our usage in the past, which I'll need to dredge out of the > > archives. I will try to get onto a cfarm POWER machine and see if I > > can reproduce that, before and after these commits, and whose bug is > > it etc. > > I'm quite sure that jiting did pass on ppc64 at some point. Yeah, I remember debugging it on EDB's POWER machine. First off, we know that LLVM < 7 doesn't work for us on POWER, because: https://www.postgresql.org/message-id/CAEepm%3D39F_B3Ou8S3OrUw%2BhJEUP3p%3DwCu0ug-TTW67qKN53g3w%40mail.gmail.com That was fixed: https://github.com/llvm/llvm-project/commit/a95b0df5eddbe7fa1e9f8fe0b1ff62427e1c0318 So I think that means that we'd first have to go through those animals and figure out which ones have older LLVM, and ignore those results -- they just can't use --with-llvm. Unfortunately there doesn't seem to be any clue on the version from the paths used by OpenSUSE. Mark, do you know? > > I doubt I can get anywhere near an s390x though, and we definitely had > > pre-existing problems on that arch. > > IMO an LLVM bug, rather than a postgres bug, but I guess it's all a matter of > perspective. > https://github.com/llvm/llvm-project/issues/53009#issuecomment-1042748553 Ah, good to know about that. But there are also reports of crashes in released versions that manage to get passed that ABI-wobble business: https://www.postgresql.org/message-id/flat/CAF1DzPXjpPxnsgySz2Zjm8d2dx7%3DJ070C%2BMQBFh%2B9NBNcBKCAg%40mail.gmail.com
Hi, On 2023-10-21 12:02:51 +1300, Thomas Munro wrote: > On Sat, Oct 21, 2023 at 11:12 AM Andres Freund <andres@anarazel.de> wrote: > > > I doubt I can get anywhere near an s390x though, and we definitely had > > > pre-existing problems on that arch. > > > > IMO an LLVM bug, rather than a postgres bug, but I guess it's all a matter of > > perspective. > > https://github.com/llvm/llvm-project/issues/53009#issuecomment-1042748553 > > Ah, good to know about that. But there are also reports of crashes in > released versions that manage to get passed that ABI-wobble business: > > https://www.postgresql.org/message-id/flat/CAF1DzPXjpPxnsgySz2Zjm8d2dx7%3DJ070C%2BMQBFh%2B9NBNcBKCAg%40mail.gmail.com Trying to debug that now, using access to an s390x box provided by Mark... Greetings, Andres Freund
On Sat, Oct 21, 2023 at 12:02 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Oct 21, 2023 at 11:12 AM Andres Freund <andres@anarazel.de> wrote: > > I'm quite sure that jiting did pass on ppc64 at some point. > > Yeah, I remember debugging it on EDB's POWER machine. First off, we > know that LLVM < 7 doesn't work for us on POWER, because: > > https://www.postgresql.org/message-id/CAEepm%3D39F_B3Ou8S3OrUw%2BhJEUP3p%3DwCu0ug-TTW67qKN53g3w%40mail.gmail.com > > That was fixed: > > https://github.com/llvm/llvm-project/commit/a95b0df5eddbe7fa1e9f8fe0b1ff62427e1c0318 > > So I think that means that we'd first have to go through those animals > and figure out which ones have older LLVM, and ignore those results -- > they just can't use --with-llvm. Unfortunately there doesn't seem to > be any clue on the version from the paths used by OpenSUSE. Mark, do > you know? Adding Mark to this subthread. Concretely, could you please disable --with-llvm on any of those machines running LLVM < 7? And report what version any remaining animals are running? (It'd be nice if the build farm logged "$LLVM_CONFIG --version" somewhere.) One of them seems to have clang 5 which is a clue -- if the LLVM is also 5 it's just not going to work, as LLVM is one of those forwards-only projects that doesn't back-patch fixes like that.
Thomas Munro <thomas.munro@gmail.com> writes: > (It'd be nice if the > build farm logged "$LLVM_CONFIG --version" somewhere.) It's not really the buildfarm script's responsibility to do that, but feel free to make configure do so. regards, tom lane
Hi, On 2023-10-19 06:20:26 +1300, Thomas Munro wrote: > Interestingly, a new problem just showed up on the the RHEL9 s390x > machine "lora", where a previously reported problem [1] apparently > re-appeared. It complains about incompatible layout, previously > blamed on mismatch between clang and LLVM versions. I've attached a patch revision that I spent the last couple hours working on. It's very very roughly based on a patch Tom Stellard had written (which I think a few rpm packages use). But instead of encoding details about specific layout details, I made the code check if the data layout works and fall back to the cpu / features used for llvmjit_types.bc. This way it's not s390x specific, future odd architecture behaviour would "automatically" be handled the same. With that at least the main regression tests pass on s390x, even with jit_above_cost=0. > I can see that its clang is v15 from clues in the conflig log, but I don't > know which version of LLVM is being used. However, I see now that > --with-llvm was literally just turned on, so there is no reason to think > that this would have worked before or this work is relevant. Strange though > -- we must be able to JIT further than that on s390x because we have crash > reports in other threads (ie we made it past this and into other more > advanced brokenness). You can avoid the borkedness by a) running on an older cpu b) adding compilation flags to change the code generation target (e.g. -march=native). And some RPM packages have applied the patch by Tom Stellard. > [1] https://www.postgresql.org/message-id/flat/20210319190047.7o4bwhbp5dzkqif3%40alap3.anarazel.de#ec51b488ca8eac8c603d91c0439d38b2 Greetings, Andres Freund
Attachment
On Sat, Oct 21, 2023 at 7:08 PM Andres Freund <andres@anarazel.de> wrote: > I've attached a patch revision that I spent the last couple hours working > on. It's very very roughly based on a patch Tom Stellard had written (which I > think a few rpm packages use). But instead of encoding details about specific > layout details, I made the code check if the data layout works and fall back > to the cpu / features used for llvmjit_types.bc. This way it's not s390x > specific, future odd architecture behaviour would "automatically" be handled > the same The explanation makes sense and this seems like a solid plan to deal with it. I didn't try on a s390x, but I tested locally on our master branch with LLVM 7, 10, 17, 18, and then I hacked your patch to take the fallback path as if a layout mismatch had been detected, and it worked fine: 2023-10-22 11:49:55.663 NZDT [12000] DEBUG: detected CPU "skylake", with features "...", resulting in layout "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" 2023-10-22 11:49:55.664 NZDT [12000] DEBUG: detected CPU / features yield incompatible data layout, using values from module instead 2023-10-22 11:49:55.664 NZDT [12000] DETAIL: module CPU "x86-64", features "...", resulting in layout "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" + To deal with that, check if data layouts match during JIT initialization. If + the runtime detected cpu / features result in a different layout, try if the + cpu/features recorded in in llvmjit_types.bc work. s|try |check | s| in in | in | + errmsg_internal("could not determine working CPU / feature comination for JIT compilation"), s|comination|combination| s| / |/|g
On Sat, Oct 21, 2023 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > (It'd be nice if the > > build farm logged "$LLVM_CONFIG --version" somewhere.) > > It's not really the buildfarm script's responsibility to do that, > but feel free to make configure do so. Done, copying the example of how we do it for perl and various other things.
On Sun, Oct 22, 2023 at 2:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Oct 21, 2023 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > (It'd be nice if the > > > build farm logged "$LLVM_CONFIG --version" somewhere.) > > > > It's not really the buildfarm script's responsibility to do that, > > but feel free to make configure do so. > > Done, copying the example of how we do it for perl and various other things. Build farm measles update: With that we can see that nicator (LLVM 15 on POWER) is green. We can see that cavefish (LLVM 6 on POWER) is red as expected. We can also see that bonito (LLVM 7 on POWER) is red, so my earlier theory that this might be due to the known bug we got fixed in LLVM 7 is not enough. Maybe there are other things fixed on POWER somewhere between those LLVM versions? I suspect it'll be hard to figure out without debug builds and backtraces. One thing is definitely our fault, though. xenodermus shows failures on REL_12_STABLE and REL_13_STABLE, using debug LLVM 6 on x86. I couldn't reproduce this locally on (newer) debug LLVM, so I bugged Andres for access to the host/libraries and chased it down. There is some type punning for a function parameter REL_13_STABLE and earlier, removed by Andres in REL_14_STABLE, and when I back-patched my refactoring I effectively back-patched a few changes from his commit df99ddc70b97 that removed the type punning, but I should have brought one more line from that commit to remove another trace of it. See attached.
Attachment
On Mon, Oct 23, 2023 at 01:15:04PM +1300, Thomas Munro wrote: > On Sun, Oct 22, 2023 at 2:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sat, Oct 21, 2023 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Thomas Munro <thomas.munro@gmail.com> writes: > > > > (It'd be nice if the > > > > build farm logged "$LLVM_CONFIG --version" somewhere.) > > > > > > It's not really the buildfarm script's responsibility to do that, > > > but feel free to make configure do so. > > > > Done, copying the example of how we do it for perl and various other things. > > Build farm measles update: > > With that we can see that nicator (LLVM 15 on POWER) is green. We can > see that cavefish (LLVM 6 on POWER) is red as expected. We can also > see that bonito (LLVM 7 on POWER) is red, so my earlier theory that > this might be due to the known bug we got fixed in LLVM 7 is not > enough. Maybe there are other things fixed on POWER somewhere between > those LLVM versions? I suspect it'll be hard to figure out without > debug builds and backtraces. I haven't gotten around to disabling llvm on any of my animals with llvm < 7 yet. Do you still want to hold on that? > One thing is definitely our fault, though. xenodermus shows failures > on REL_12_STABLE and REL_13_STABLE, using debug LLVM 6 on x86. I > couldn't reproduce this locally on (newer) debug LLVM, so I bugged > Andres for access to the host/libraries and chased it down. There is > some type punning for a function parameter REL_13_STABLE and earlier, > removed by Andres in REL_14_STABLE, and when I back-patched my > refactoring I effectively back-patched a few changes from his commit > df99ddc70b97 that removed the type punning, but I should have brought > one more line from that commit to remove another trace of it. See > attached. Here are my list of llvm-config versions and distros for s390x and POWER (didn't see any issues on aarch64 but I grabbed all the info at the same time.) s390x: branta: 10.0.0 Ubuntu 20.04.4 LTS cotinga: 6.0.0 Ubuntu 18.04.6 LTS perch: 6.0.0 Ubuntu 18.04.6 LTS sarus: 14.0.0 Ubuntu 22.04.1 LTS aracari: 15.0.7 Red Hat Enterprise Linux 8.6 pipit: 15.0.7 Red Hat Enterprise Linux 8.6 lora: 15.0.7 Red Hat Enterprise Linux 9.2 mamushi: 15.0.7 Red Hat Enterprise Linux 9.2 pike: 11.0.1 Debian GNU/Linux 11 rinkhals: 11.0.1 Debian GNU/Linux 11 POWER: bonito: 7.0.1 Fedora 29 cavefish: 6.0.0 Ubuntu 18.04.6 LTS demoiselle: 5.0.1 openSUSE Leap 15.0 elasmobranch: 5.0.1 openSUSE Leap 15.0 babbler: 15.0.7 AlmaLinux 8.8 pytilia: 15.0.7 AlmaLinux 8.8 nicator: 15.0.7 AlmaLinux 9.2 twinspot: 15.0.7 AlmaLinux 9.2 cascabel: 11.0.1 Debian GNU/Linux 11 habu: 16.0.6 Fedora Linux 38 kingsnake: 16.0.6 Fedora Linux 38 krait: CentOS 16.0.6 Stream 8 lancehead: CentOS 16.0.6 Stream 8 aarch64: boiga: 14.0.5 Fedora Linux 36 corzo: 14.0.5 Fedora Linux 36 desman: 16.0.6 Fedora Linux 38 motmot: 16.0.6 Fedora Linux 38 whinchat: 11.0.1 Debian GNU/Linux 11 jackdaw: 11.0.1 Debian GNU/Linux 11 blackneck: 7.0.1 Debian GNU/Linux 10 alimoche: 7.0.1 Debian GNU/Linux 10 bulbul: 15.0.7 AlmaLinux 8.8 broadbill: 15.0.7 AlmaLinux 8.8 oystercatcher: 15.0.7 AlmaLinux 9.2 potoo: 15.0.7 AlmaLinux 9.2 whiting: 6.0.0 Ubuntu 18.04.5 LTS vimba: 6.0.0 Ubuntu 18.04.5 LTS splitfin: 10.0.0 Ubuntu 20.04.6 LTS rudd: 10.0.0 Ubuntu 20.04.6 LTS turbot: 14.0.0 Ubuntu 22.04.3 LTS shiner: 14.0.0 Ubuntu 22.04.3 LTS ziege: 16.0.6 CentOS Stream 8 chevrotain: 11.0.1 Debian GNU/Linux 11 Regards, Mark
On Tue, Oct 24, 2023 at 4:27 AM Mark Wong <markwkm@gmail.com> wrote: > I haven't gotten around to disabling llvm on any of my animals with llvm > < 7 yet. Do you still want to hold on that? Yes, please disable --with-llvm on s390x and POWER animals with LLVM < 7 (see below). Also, you have a bunch of machines with LLVM 16 that are failing to compile on REL_11_STABLE. That is expected, because we agreed not to back-patch the LLVM 16 API changes into REL_11_STABLE: > kingsnake: 16.0.6 Fedora Linux 38 > krait: CentOS 16.0.6 Stream 8 > lancehead: CentOS 16.0.6 Stream 8 These POWER machines fail as expected, and it's unfixable: > elasmobranch: 5.0.1 openSUSE Leap 15.0 > demoiselle: 5.0.1 openSUSE Leap 15.0 > cavefish: 6.0.0 Ubuntu 18.04.6 LTS (Well, we could in theory supply our own fixed llvm::orc::createLocalIndirectStubsManagerBuilder() function to hide the broken one in LLVM <= 6, but that way lies madness IMHO. An LTS distro that cares could look into back-patching LLVM's fixes, but for us, let us focus on current software.) This POWER animal fails, unexpectedly to me: > bonito: 7.0.1 Fedora 29 We could try to chase that down, or we could rejoice that at least it works on current release. It must begin working somewhere between 7 and 11, but when I checked which LLVM releases I could easily install on eg cascabel (if I could get access) using the repo at apt.llvm.org, I saw that they don't even have anything older than 11. So someone with access who wants to figure this out might have many days or weeks of compiling ahead of them. These POWER animals are passing, as expected: > cascabel: 11.0.1 Debian GNU/Linux 11 > babbler: 15.0.7 AlmaLinux 8.8 > pytilia: 15.0.7 AlmaLinux 8.8 > nicator: 15.0.7 AlmaLinux 9.2 > twinspot: 15.0.7 AlmaLinux 9.2 > habu: 16.0.6 Fedora Linux 38 > kingsnake: 16.0.6 Fedora Linux 38 > krait: CentOS 16.0.6 Stream 8 > lancehead: CentOS 16.0.6 Stream 8 These s390x animals are passing: > branta: 10.0.0 Ubuntu 20.04.4 LTS > pike: 11.0.1 Debian GNU/Linux 11 > rinkhals: 11.0.1 Debian GNU/Linux 11 > sarus: 14.0.0 Ubuntu 22.04.1 LTS These s390x animals are failing, but don't show the layout complaint. I can see that LLVM 6 also lacked a case for s390x in llvm::orc::createLocalIndirectStubsManagerBuilder(), the thing that was fixed in 7 with the addition of a default case. Therefore these presumably fail just like old LLVM on POWER, and it's unfixable. So I suggest turning off --with-llvm on these two: > cotinga: 6.0.0 Ubuntu 18.04.6 LTS > perch: 6.0.0 Ubuntu 18.04.6 LTS These s390x animals are failing with the mismatched layout problem, which should be fixed by Andres's patch to tolerate the changing z12/z13 ABI thing by falling back to whatever clang picked (at a cost of not using all the features of your newer CPU, unless you explicitly tell clang to target it): > aracari: 15.0.7 Red Hat Enterprise Linux 8.6 > pipit: 15.0.7 Red Hat Enterprise Linux 8.6 > lora: 15.0.7 Red Hat Enterprise Linux 9.2 This s390x animal doesn't actually have --with-llvm enabled so it passes, but surely it'd be just like lora: > mamushi: 15.0.7 Red Hat Enterprise Linux 9.2
On Tue, Oct 24, 2023 at 10:17:22AM +1300, Thomas Munro wrote: > On Tue, Oct 24, 2023 at 4:27 AM Mark Wong <markwkm@gmail.com> wrote: > > I haven't gotten around to disabling llvm on any of my animals with llvm > > < 7 yet. Do you still want to hold on that? > > Yes, please disable --with-llvm on s390x and POWER animals with LLVM < > 7 (see below). Also, you have a bunch of machines with LLVM 16 that > are failing to compile on REL_11_STABLE. That is expected, because we > agreed not to back-patch the LLVM 16 API changes into REL_11_STABLE: > > > kingsnake: 16.0.6 Fedora Linux 38 > > krait: CentOS 16.0.6 Stream 8 > > lancehead: CentOS 16.0.6 Stream 8 I should have updated these to not use --with-llvm for REL_11_STABLE. > These POWER machines fail as expected, and it's unfixable: > > > elasmobranch: 5.0.1 openSUSE Leap 15.0 > > demoiselle: 5.0.1 openSUSE Leap 15.0 > > cavefish: 6.0.0 Ubuntu 18.04.6 LTS These should now be updated to not use --with-llvm at all. > These s390x animals are failing, but don't show the layout complaint. > I can see that LLVM 6 also lacked a case for s390x in > llvm::orc::createLocalIndirectStubsManagerBuilder(), the thing that > was fixed in 7 with the addition of a default case. Therefore these > presumably fail just like old LLVM on POWER, and it's unfixable. So I > suggest turning off --with-llvm on these two: > > > cotinga: 6.0.0 Ubuntu 18.04.6 LTS > > perch: 6.0.0 Ubuntu 18.04.6 LTS Ok, I should have removed --with-llvm here too. > This s390x animal doesn't actually have --with-llvm enabled so it > passes, but surely it'd be just like lora: > > > mamushi: 15.0.7 Red Hat Enterprise Linux 9.2 Oops, I think I added it now. I think I made all the recommended changes, and trimmed out the lines where I didn't need to do anything. :) Andres pointed out to me that my animals aren't set up to collect core file so I'm also trying to update that too... Regards, Mark
Hi, On 2023-10-24 10:17:22 +1300, Thomas Munro wrote: > This POWER animal fails, unexpectedly to me: > > > bonito: 7.0.1 Fedora 29 > > We could try to chase that down, or we could rejoice that at least it > works on current release. It must begin working somewhere between 7 > and 11, but when I checked which LLVM releases I could easily install > on eg cascabel (if I could get access) using the repo at apt.llvm.org, > I saw that they don't even have anything older than 11. So someone > with access who wants to figure this out might have many days or weeks > of compiling ahead of them. I could reproduce the failure on bonito. The stack trace is: #0 0x00007fffb83541e8 in raise () from /lib64/libc.so.6 #1 0x00007fffb833448c in abort () from /lib64/libc.so.6 #2 0x00007fff9c68dd78 in std::__replacement_assert (_file=<optimized out>, _line=<optimized out>, _function=<optimized out>,_condition=<optimized out>) at /usr/include/c++/8/ppc64le-redhat-linux/bits/c++config.h:447 #3 0x00007fff9df90838 in std::unique_ptr<llvm::orc::JITCompileCallbackManager, std::default_delete<llvm::orc::JITCompileCallbackManager>>::operator* ( this=0x1b946cb8) at ../include/llvm/Support/MemAlloc.h:29 #4 llvm::OrcCBindingsStack::OrcCBindingsStack(llvm::TargetMachine&, std::function<std::unique_ptr<llvm::orc::IndirectStubsManager,std::default_delete<llvm::orc::IndirectStubsManager> > ()>)(this=0x1b946be0, TM=..., IndirectStubsMgrBuilder=...) at ../lib/ExecutionEngine/Orc/OrcCBindingsStack.h:242 #5 0x00007fff9df90940 in LLVMOrcCreateInstance (TM=0x1b933ae0) at /usr/include/c++/8/bits/move.h:182 #6 0x00007fffa0618f8c in llvm_session_initialize () at /home/andres/src/postgres/src/backend/jit/llvm/llvmjit.c:981 #7 0x00007fffa06179a8 in llvm_create_context (jitFlags=25) at /home/andres/src/postgres/src/backend/jit/llvm/llvmjit.c:219 #8 0x00007fffa0626cbc in llvm_compile_expr (state=0x1b8ef390) at /home/andres/src/postgres/src/backend/jit/llvm/llvmjit_expr.c:142 #9 0x0000000010a76fc8 in jit_compile_expr (state=0x1b8ef390) at /home/andres/src/postgres/src/backend/jit/jit.c:177 #10 0x0000000010404550 in ExecReadyExpr (state=0x1b8ef390) at /home/andres/src/postgres/src/backend/executor/execExpr.c:875 with this assertion message printed: /usr/include/c++/8/bits/unique_ptr.h:328: typename std::add_lvalue_reference<_Tp>::type std::unique_ptr<_Tp, Dp>::operator*()const [with Tp = llvm::orc::JITCompileCallbackManager; _Dp = std::default_delete<llvm::orc::JITCompileCallbackManager>;typename std::add_lvalue_reference<_Tp>::type = llvm::orc::JITCompileCallbackManager&]:Assertion 'get() != pointer()' failed. I wanted to use a debug build to investigate in more detail, but bonito is a small VM. Thus I built llvm 7 on a more powerful gcc cfarm machine, running on AlmaLinux 9.2. The problem doesn't reproduce there. Given the crash in some c++ standard library code, that the fc29 patches to llvm look harmless, that building/using llvm 7 on a newer distro does not show issues on PPC, it seems likely that this is a compiler / standard library issue. FC 29 is well out of support, so I don't think it makes sense to invest any further time in this. Personally, I don't think it's useful to have years old fedora in the buildfarm... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > FC 29 is well out of support, so I don't think it makes sense to invest any > further time in this. Personally, I don't think it's useful to have years old > fedora in the buildfarm... +1. It's good to test old LTS distros, but Fedora releases have a short shelf life by design. regards, tom lane
On Mon, Oct 23, 2023 at 10:47:24PM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > FC 29 is well out of support, so I don't think it makes sense to invest any > > further time in this. Personally, I don't think it's useful to have years old > > fedora in the buildfarm... > > +1. It's good to test old LTS distros, but Fedora releases have a > short shelf life by design. I'll start retiring those old Fedora ones I have. :) Regards, Mark