Thread: BUG #16971: Incompatible datalayout errors with llvmjit

BUG #16971: Incompatible datalayout errors with llvmjit

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16971
Logged by:          Tom Stellard
Email address:      tstellar@redhat.com
PostgreSQL version: 13.2
Operating system:   Fedora
Description:

In our Fedora builds, we are getting errors[1] in the postgresql tests due
to incompatible datalayouts between the JIT engine and the LLVM modules
being compiled.  The problem is that the JIT engine is being created with
host specific CPU and features, while the datalayout for the compiled module
is being taken from llvmjit_types.bc which is compiled without any specified
CPU type or features.

One way to fix this would be to add -march=native to the %.bc rules in
src/Makefile.global.in.  However, this will only work when the build system
and the run system are the same.

I think to fix this correctly, the llvmjit_types.bc file will need to be
compiled when the JIT engine is initialized at runtime, so that it can use
the same datalayout as the JIT engine.

[1] https://kojipkgs.fedoraproject.org//work/tasks/2182/66082182/build.log


Re: BUG #16971: Incompatible datalayout errors with llvmjit

From
Andres Freund
Date:
Hi,

Thanks to tgl for pointing me this thread...

On 2021-04-19 18:29:52 +0000, PG Bug reporting form wrote:
> In our Fedora builds, we are getting errors[1] in the postgresql tests due
> to incompatible datalayouts between the JIT engine and the LLVM modules
> being compiled.  The problem is that the JIT engine is being created with
> host specific CPU and features, while the datalayout for the compiled module
> is being taken from llvmjit_types.bc which is compiled without any specified
> CPU type or features.

It's very odd that features would change the data layout - analogizing
with plain C code that'd mean that you cannot link a binary compiled
with something like -mavx2 against a library compiled without. To me
this smells like a bug somewhere lower level.

Reformatting the error yields:
ERROR:  failed to JIT module: Added modules have incompatible data layouts:
E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-        a:8:16-n32:64 (module) vs
E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)

The -v128:64 is about how to align vectors. Skimming the relevant LLVM
code I don't see why it'd be included in JIted code but not native code.

Just to be sure, I take
checking for llvm-config... /usr/bin/llvm-config
checking for clang... /usr/bin/clang
are for llvm & clang compiled from the same code?

I temporarily had access to s390x to debug an unrelated issue in the
past, and there I did hit this problem, but IIRC only because of clang
vs llvm version mismatches.

Unfortunately it is a bit hard to debug without access to a s390x
box... Can you provide that? If not I might ask the Debian folks for
access to one of their porter machines.

FWIW, the Debian s390x build seems to succeed at the moment:
https://buildd.debian.org/status/fetch.php?pkg=postgresql-13&arch=s390x&ver=13.2-1&stamp=1613044202&raw=0


> One way to fix this would be to add -march=native to the %.bc rules in
> src/Makefile.global.in.  However, this will only work when the build system
> and the run system are the same.

> I think to fix this correctly, the llvmjit_types.bc file will need to be
> compiled when the JIT engine is initialized at runtime, so that it can use
> the same datalayout as the JIT engine.

That'd require headers to be present, which I don't think we should
require... And more importantly, it'd not go very far, because we also
have lot of other .bc files that will have the layout embedded (for
inlining functions/operators into JITed code). Which'd then require all
the source code to be present and to be compiled into bitcode. Not an,
uh, satisfying option.



> [1] https://kojipkgs.fedoraproject.org//work/tasks/2182/66082182/build.log

Random thing I noticed while scrolling through the log:
> configure: WARNING: unrecognized options: --disable-dependency-tracking

PG requires dependency tracking to be explicitly enabled, and it's a
different flag name (--enable-depend).

Greetings,

Andres Freund



Re: BUG #16971: Incompatible datalayout errors with llvmjit

From
Tom Stellard
Date:
On 4/20/21 12:29 PM, Andres Freund wrote:
> Hi,
> 
> Thanks to tgl for pointing me this thread...
> 
> On 2021-04-19 18:29:52 +0000, PG Bug reporting form wrote:
>> In our Fedora builds, we are getting errors[1] in the postgresql tests due
>> to incompatible datalayouts between the JIT engine and the LLVM modules
>> being compiled.  The problem is that the JIT engine is being created with
>> host specific CPU and features, while the datalayout for the compiled module
>> is being taken from llvmjit_types.bc which is compiled without any specified
>> CPU type or features.
> 
> It's very odd that features would change the data layout - analogizing
> with plain C code that'd mean that you cannot link a binary compiled
> with something like -mavx2 against a library compiled without. To me
> this smells like a bug somewhere lower level.
> 

You are correct that is odd, and to be honest, I didn't think that LLVM
targets were allowed to change the datalayout based on the CPU type.
I'm checking with upstream LLVM to see if this allowed or not.  However,
this behavior is present in at least LLVM 11 and LLVM 12 (I haven't
checked earlier versions), so postgresql will have to deal with this
somehow.

> Reformatting the error yields:
> ERROR:  failed to JIT module: Added modules have incompatible data layouts:
> E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-        a:8:16-n32:64 (module) vs
> E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)
> 
> The -v128:64 is about how to align vectors. Skimming the relevant LLVM
> code I don't see why it'd be included in JIted code but not native code.
> 

The relevant code in LLVM is here:
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp#L88

> Just to be sure, I take
> checking for llvm-config... /usr/bin/llvm-config
> checking for clang... /usr/bin/clang
> are for llvm & clang compiled from the same code?
> 

Yes, they are same.

> I temporarily had access to s390x to debug an unrelated issue in the
> past, and there I did hit this problem, but IIRC only because of clang
> vs llvm version mismatches.
> 
> Unfortunately it is a bit hard to debug without access to a s390x
> box... Can you provide that? If not I might ask the Debian folks for
> access to one of their porter machines.
> 

You can demonstrate this issue on an x86_64 machine using clang:

# Equivalent to how llvmjit_types.bc is compiled:
[root@8bbe06c6258e ~]# echo "void foo(){}" | clang --target=s390x--  -x c -S -emit-llvm - -o - | grep datalayout
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64"

# Equivalent to how JIT'd code is compiled:
[root@8bbe06c6258e ~]# echo "void foo(){}" | clang --target=s390x-- -march=z13 -x c -S -emit-llvm - -o - | grep
datalayout
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64"


> FWIW, the Debian s390x build seems to succeed at the moment:
> https://buildd.debian.org/status/fetch.php?pkg=postgresql-13&arch=s390x&ver=13.2-1&stamp=1613044202&raw=0
> 

The Fedora builders are z13 systems.  If Debian is using zEC12, then this issue
would not happen, since zEC12 does not use the vector ABI:

echo "void foo(){}" | clang --target=s390x-- -march=zEC12 -x c -S -emit-llvm - -o - | grep datalayout
target datalayout = "E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-a:8:16-n32:64"

However, if someone were to install Debian postgresql on a z13 system, they
would likely run into this bug.

> 
>> One way to fix this would be to add -march=native to the %.bc rules in
>> src/Makefile.global.in.  However, this will only work when the build system
>> and the run system are the same.
> 
>> I think to fix this correctly, the llvmjit_types.bc file will need to be
>> compiled when the JIT engine is initialized at runtime, so that it can use
>> the same datalayout as the JIT engine.
> 
> That'd require headers to be present, which I don't think we should
> require... And more importantly, it'd not go very far, because we also
> have lot of other .bc files that will have the layout embedded (for
> inlining functions/operators into JITed code). Which'd then require all
> the source code to be present and to be compiled into bitcode. Not an,
> uh, satisfying option.
> 

OK, if that's not an option, then there needs to be a way to match the CPU
type used to compile llvmjit_types.bc and the CPU types used for the JIT
(at least for s390x anyway).  I looked at a few other targets, and it
doesn't look like they have this same problem (but I could be wrong).

For Fedora, I may try to hard-code the JIT CPU type to zEC12 on s390x builds,
to fix this.  I don't know if this would work as a general solution.

> 
> 
>> [1] https://kojipkgs.fedoraproject.org//work/tasks/2182/66082182/build.log
> 
> Random thing I noticed while scrolling through the log:
>> configure: WARNING: unrecognized options: --disable-dependency-tracking
> 
> PG requires dependency tracking to be explicitly enabled, and it's a
> different flag name (--enable-depend).
> 

OK, thanks.  I will pass this along to the postgresql maintainers.

-Tom

> Greetings,
> 
> Andres Freund
> 




Re: BUG #16971: Incompatible datalayout errors with llvmjit

From
Andres Freund
Date:
Hi,

On 2021-04-20 14:42:28 -0700, Tom Stellard wrote:
> On 4/20/21 12:29 PM, Andres Freund wrote:
> > On 2021-04-19 18:29:52 +0000, PG Bug reporting form wrote:
> > > In our Fedora builds, we are getting errors[1] in the postgresql tests due
> > > to incompatible datalayouts between the JIT engine and the LLVM modules
> > > being compiled.  The problem is that the JIT engine is being created with
> > > host specific CPU and features, while the datalayout for the compiled module
> > > is being taken from llvmjit_types.bc which is compiled without any specified
> > > CPU type or features.
> >
> > It's very odd that features would change the data layout - analogizing
> > with plain C code that'd mean that you cannot link a binary compiled
> > with something like -mavx2 against a library compiled without. To me
> > this smells like a bug somewhere lower level.

> You are correct that is odd, and to be honest, I didn't think that LLVM
> targets were allowed to change the datalayout based on the CPU type.

That was my impression...


> > Reformatting the error yields:
> > ERROR:  failed to JIT module: Added modules have incompatible data layouts:
> > E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-        a:8:16-n32:64 (module) vs
> > E-m:e-i1:8:16-i8:8:16-i64:64-f128:64-v128:64-a:8:16-n32:64 (jit)
> >
> > The -v128:64 is about how to align vectors. Skimming the relevant LLVM
> > code I don't see why it'd be included in JIted code but not native code.

> The relevant code in LLVM is here:
> https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp#L88

Thanks for the pointer!


> I'm checking with upstream LLVM to see if this allowed or not.  However,
> this behavior is present in at least LLVM 11 and LLVM 12 (I haven't
> checked earlier versions), so postgresql will have to deal with this
> somehow.

Yea, seems we need to add a workaround for the issue, given how much
longer LLVM releases tend to be used than they are maintained. One
simple hack would be to add "-vector" to the list of features on s390x,
which afaict should avoid the issue for now?

In LLVM's main branch the code is this:

// Determine whether we use the vector ABI.
static bool UsesVectorABI(StringRef CPU, StringRef FS) {
  // We use the vector ABI whenever the vector facility is avaiable.
  // This is the case by default if CPU is z13 or later, and can be
  // overridden via "[+-]vector" feature string elements.
  bool VectorABI = true;
  bool SoftFloat = false;
  if (CPU.empty() || CPU == "generic" ||
      CPU == "z10" || CPU == "z196" || CPU == "zEC12" ||
      CPU == "arch8" || CPU == "arch9" || CPU == "arch10")
    VectorABI = false;

  SmallVector<StringRef, 3> Features;
  FS.split(Features, ',', -1, false /* KeepEmpty */);
  for (auto &Feature : Features) {
    if (Feature == "vector" || Feature == "+vector")
      VectorABI = true;
    if (Feature == "-vector")
      VectorABI = false;
    if (Feature == "soft-float" || Feature == "+soft-float")
      SoftFloat = true;
    if (Feature == "-soft-float")
      SoftFloat = false;
  }

  return VectorABI && !SoftFloat;
}

So appending -vector should be sufficient?


But we'd have to do so only after checking that there's a data layout
mismatch, because otherwise we'd just create a new problem if somebody
compiles with -march=native or such.


Greetings,

Andres Freund