Re: BUG #16696: Backend crash in llvmjit - Mailing list pgsql-bugs

From Dmitry Marakasov
Subject Re: BUG #16696: Backend crash in llvmjit
Date
Msg-id 20201104235054.GB30304@hades.panopticon
Whole thread Raw
In response to Re: BUG #16696: Backend crash in llvmjit  (Dmitry Marakasov <amdmi3@amdmi3.ru>)
List pgsql-bugs
* Dmitry Marakasov (amdmi3@amdmi3.ru) wrote:

> > > > Environment details:
> > > > - FreeBSD 12.1 amd64
> > > > - PostgreSQL 13.0 (built from FreeBSD ports)
> > > > - llvm-10.0.1 (build from FreeBSD ports)
> > > 
> > > My bad, it's actually llvm-9.0.1. Multiple llvm versions are installed on
> > > the system, and PostgreSQL uses llvm9:
> > > 
> > > ldd /usr/local/lib/postgresql/llvmjit.so | grep LLVM
> > >     libLLVM-9.so => /usr/local/llvm90/lib/libLLVM-9.so (0x800e00000)
> > 
> > Could you try generating a backtrace after turning jit_debugging_support on? That might give a bit more
information.
> > 
> > I'll check once I'm home whether I can reproduce in my environment.
> 
> I did some digging. First of all, I've discovered that the problem
> goes away if llvm bitcode optimization is disabled (by commenting out
> llvm_optimize_module call).
> 
> I've dumped the opcode and tried compiling it back to match disassembly
> of the failing function in gdb disassembly. It didn't match perfectly,
> but this place looked similar:
> 
> # %bb.84:                               # %op.32.inputcall
>     movq    %rax, 5267(%r13)
>     movb    %bl, 5275(%r13)
>     movb    $0, 5263(%r13)
>     movzbl  (%rax), %esi
>     movl    __mb_sb_limit(%rip), %edi
>     movq    _ThreadRuneLocale@GOTTPOFF(%rip), %rcx
>     movq    %fs:0, %rdx
>     movq    (%rdx,%rcx), %rcx
>     cmpl    %esi, %edi
>     movq    %rax, -96(%rbp)         # 8-byte Spill
>     movl    %edi, -72(%rbp)         # 4-byte Spill
>     movq    %rcx, -64(%rbp)         # 8-byte Spill
> jle     .LBB1_85
> 
> Here's my hypothesis:
> 
> The problem happens when boolin() function is inlined by LLVM.
> The named function calls isspace() internally, which on FreeBSD is
> locale-specific and involves caching some locale parameters in
> thread-local variable defined as
> 
> extern _Thread_local const _RuneLocale *_ThreadRuneLocale;
> 
> The execution crashes on trying to access the named thread-local varible,
> probably because something related to TLS is not set up properly in/for
> LLVM.
> 
> I've confirmed this hypothesis by disabling isspace() calls in boolin()
> which has also fixed the problem.

Long story short, I was able to mitigate the crash with the following patch:

--- disable-inlining-tls-using-functions.patch begins here ---
commit f703544edc406293e39b7a59a245e798d18f458e
Author: Dmitry Marakasov <amdmi3@amdmi3.ru>
Date:   Thu Nov 5 02:56:00 2020 +0300

    Do not inline functions accessing TLS in LLVM JIT

diff --git src/backend/jit/llvm/llvmjit_inline.cpp src/backend/jit/llvm/llvmjit_inline.cpp
index 2617a46..a063edb 100644
--- src/backend/jit/llvm/llvmjit_inline.cpp
+++ src/backend/jit/llvm/llvmjit_inline.cpp
@@ -608,6 +608,16 @@ function_inlinable(llvm::Function &F,
         if (rv->materialize())
             elog(FATAL, "failed to materialize metadata");
 
+        /*
+         * Don't inline functions with thread-local variables until
+         * related crashes are investigated (see BUG #16696)
+         */
+        if (rv->isThreadLocal()) {
+            ilog(DEBUG1, "cannot inline %s due to thread-local variable %s",
+                F.getName().data(), rv->getName().data());
+            return false;
+        }
+
         /*
          * Never want to inline externally visible vars, cheap enough to
          * reference.
--- disable-inlining-tls-using-functions.patch ends here ---

I have no knowledge of LLVM to investigate this further, but the guess
is that something TLS related is not initialized properly.

-- 
Dmitry Marakasov   .   55B5 0596 FF1E 8D84 5F56  9510 D35A 80DD F9D2 F77D
amdmi3@amdmi3.ru  ..:              https://github.com/AMDmi3




pgsql-bugs by date:

Previous
From: Dmitry Marakasov
Date:
Subject: Re: BUG #16696: Backend crash in llvmjit
Next
From: Amit Kapila
Date:
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop