Thread: Do we work with LLVM 12 on s390x?
The Red Hat folk are seeing a problem with that combination: https://bugzilla.redhat.com/show_bug.cgi?id=1940964 which boils down to > Build fails with this error: > ERROR: failed to JIT module: Added modules have incompatible data layouts: E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64(module) vs E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64(jit) (By "build", I imagine the reporter means "regression tests") So I was wondering if we'd tested it yet. regards, tom lane
Hi, On 2021-03-19 14:03:21 -0400, Tom Lane wrote: > The Red Hat folk are seeing a problem with that combination: > > https://bugzilla.redhat.com/show_bug.cgi?id=1940964 > > which boils down to > > > Build fails with this error: > > ERROR: failed to JIT module: Added modules have incompatible data layouts: E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64(module) vs E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64(jit) > > (By "build", I imagine the reporter means "regression tests") > > So I was wondering if we'd tested it yet. Yes, I did test it not too long ago, after Christoph Berg reported Debian s390x failing with jit. Which made me learn a bunch of s390x assembler and discover a bug in our code that only rarely happend (iirc something about booleans that are not exactly 0 or 1 not testing true)... https://www.postgresql.org/message-id/20201015222924.yyms42qjloydfvar%40alap3.anarazel.de I think the error above comes from a "mismatch" between the clang used to compile bitcode, and the LLVM version linked to. Normally we're somewhat tolerant of differences between the two, but there was an ABI change at some point, leading to that error. IIRC I hit that, but it vanished as soon as I used a matching libllvm and clang. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I think the error above comes from a "mismatch" between the clang used > to compile bitcode, and the LLVM version linked to. Normally we're > somewhat tolerant of differences between the two, but there was an ABI > change at some point, leading to that error. IIRC I hit that, but it > vanished as soon as I used a matching libllvm and clang. Thanks, I passed that advice on. regards, tom lane
On 3/19/21 8:15 PM, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> I think the error above comes from a "mismatch" between the clang used >> to compile bitcode, and the LLVM version linked to. Normally we're >> somewhat tolerant of differences between the two, but there was an ABI >> change at some point, leading to that error. IIRC I hit that, but it >> vanished as soon as I used a matching libllvm and clang. > > Thanks, I passed that advice on. > > regards, tom lane Tom Stellard was so kind to look at this issue deeper with his LLVM skills and found PostgreSQL is not actually handling the LLVM perfectly. He's working on improving the patch, but sharing even the first attempt with upstream seems like a good idea: https://src.fedoraproject.org/rpms/postgresql/pull-request/29 Regards, Honza
On 4/21/21 6:40 AM, Honza Horak wrote: > On 3/19/21 8:15 PM, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> I think the error above comes from a "mismatch" between the clang used >>> to compile bitcode, and the LLVM version linked to. Normally we're >>> somewhat tolerant of differences between the two, but there was an ABI >>> change at some point, leading to that error. IIRC I hit that, but it >>> vanished as soon as I used a matching libllvm and clang. >> >> Thanks, I passed that advice on. >> >> regards, tom lane > > Tom Stellard was so kind to look at this issue deeper with his LLVM skills and found PostgreSQL is not actually handlingthe LLVM perfectly. He's working on improving the patch, but sharing even the first attempt with upstream seems likea good idea: > > https://src.fedoraproject.org/rpms/postgresql/pull-request/29 > I wrote a new patch based on the bug discussion[1]. It works around the issue specifically on s390x rather than disabling specific CPUs and features for all targets. The patch is attached. [1] https://www.postgresql.org/message-id/flat/16971-5d004d34742a3d35%40postgresql.org > Regards, > Honza >
Attachment
On 4/22/21 6:35 PM, Tom Stellard wrote: > On 4/21/21 6:40 AM, Honza Horak wrote: >> On 3/19/21 8:15 PM, Tom Lane wrote: >>> Andres Freund <andres@anarazel.de> writes: >>>> I think the error above comes from a "mismatch" between the clang used >>>> to compile bitcode, and the LLVM version linked to. Normally we're >>>> somewhat tolerant of differences between the two, but there was an ABI >>>> change at some point, leading to that error. IIRC I hit that, but it >>>> vanished as soon as I used a matching libllvm and clang. >>> >>> Thanks, I passed that advice on. >>> >>> regards, tom lane >> >> Tom Stellard was so kind to look at this issue deeper with his LLVM >> skills and found PostgreSQL is not actually handling the LLVM >> perfectly. He's working on improving the patch, but sharing even the >> first attempt with upstream seems like a good idea: >> >> https://src.fedoraproject.org/rpms/postgresql/pull-request/29 >> > > I wrote a new patch based on the bug discussion[1]. It works around > the issue specifically on s390x rather than disabling specific > CPUs and features for all targets. The patch is attached. > > > [1] > https://www.postgresql.org/message-id/flat/16971-5d004d34742a3d35%40postgresql.org Thanks, Tom, it looks good in koji build, so merging so far. We very much appreciate your help here. Cheers, Honza > >> Regards, >> Honza >> >
On 4/22/21 3:25 PM, Andres Freund wrote: > Hi, > > On 2021-04-22 09:35:48 -0700, Tom Stellard wrote: >> On 4/21/21 6:40 AM, Honza Horak wrote: >> I wrote a new patch based on the bug discussion[1]. It works around >> the issue specifically on s390x rather than disabling specific >> CPUs and features for all targets. The patch is attached. > > Cool, this is a pretty clear improvement. There's a few minor things I'd > change to fit it into PG - do you mind if I send that to the thread at > [1] for you to test before I push it? > Sure, no problem. > >> +/* >> + * For the systemz target, LLVM uses a different datalayout for z13 and newer >> + * CPUs than it does for older CPUs. This can cause a mismatch in datalayouts >> + * in the case where the llvm_types_module is compiled with a pre-z13 CPU >> + * and the JIT is running on z13 or newer. >> + * See computeDataLayout() function in >> + * llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp for information on the >> + * datalayout differences. >> + */ >> +static bool >> +needs_systemz_workaround(void) >> +{ >> + bool ret = false; >> + LLVMContextRef llvm_context; >> + LLVMTypeRef vec_type; >> + LLVMTargetDataRef llvm_layoutref; >> + if (strncmp(LLVMGetTargetName(llvm_targetref), "systemz", strlen("systemz"))) >> + { >> + return false; >> + } >> + >> + llvm_context = LLVMGetModuleContext(llvm_types_module); >> + vec_type = LLVMVectorType(LLVMIntTypeInContext(llvm_context, 32), 4); >> + llvm_layoutref = LLVMCreateTargetData(llvm_layout); >> + ret = (LLVMABIAlignmentOfType(llvm_layoutref, vec_type) == 16); >> + LLVMDisposeTargetData(llvm_layoutref); >> + return ret; >> +} > > I wonder if it'd be better to compare LLVMCopyStringRepOfTargetData() of > the llvm_types_module with the one of the JIT target machine, and only > specify -vector in that case? We currently support older LLVM versions > than the one that introduced the vector specific handling for systemz, > and I don't know what'd happen if we unnecessarily specified -vector. > The problem is that you have to pass the features to LLVMCreateTargetMachine in order to know what the data layout of the JIT target is going to be, so the only way to make this work, would be to create the TargetMachine with the default features, check it's datalayout, and then re-create the TargetMachine in order to apply the workaround. Maybe that's not so bad? The other question I had is should we #ifdef ARCH_S390x in needs_sytemz_workaround(), so we don't need to check the target name. -Tom > Greetings, > > Andres Freund >