Re: Don't clean up LLVM state when exiting in a bad way - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Don't clean up LLVM state when exiting in a bad way |
Date | |
Msg-id | 0B6E6A15-4611-451B-936C-E8846CC8E847@anarazel.de Whole thread Raw |
In response to | Re: Don't clean up LLVM state when exiting in a bad way (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
Re: Don't clean up LLVM state when exiting in a bad way
|
List | pgsql-hackers |
Hi, On September 13, 2021 9:00:00 PM PDT, Alexander Lakhin <exclusion@gmail.com> wrote: >Hello hackers, >14.09.2021 04:32, Andres Freund wrote: >> On 2021-09-07 14:44:39 -0500, Justin Pryzby wrote: >>> On Tue, Sep 07, 2021 at 12:27:27PM -0700, Andres Freund wrote: >>>> I think this is a tad too strong. We should continue to clean up on exit as >>>> long as the error didn't happen while we're already inside llvm >>>> code. Otherwise we loose some ability to find leaks. How about checking in the >>>> error path whether fatal_new_handler_depth is > 0, and skipping cleanup in >>>> that case? Because that's precisely when it should be unsafe to reenter >>>> LLVM. >> The more important reason is actually profiling information that needs to be >> written out. >> >> I've now pushed a fix to all relevant branches. Thanks all! >> >I've encountered similar issue last week, but found this discussion only >after the commit. >I'm afraid that it's not completely gone yet. I've reproduced a similar >crash (on edb4d95d) with >echo "statement_timeout = 50 >jit_optimize_above_cost = 1 >jit_inline_above_cost = 1 >parallel_setup_cost=0 >parallel_tuple_cost=0 >" >/tmp/extra.config >TEMP_CONFIG=/tmp/extra.config make check > >parallel group (11 tests): memoize explain hash_part partition_info >reloptions tuplesort compression partition_aggregate indexing >partition_prune partition_join > partition_join ... FAILED (test process exited with >exit code 2) 1815 ms > partition_prune ... FAILED (test process exited with >exit code 2) 1779 ms > reloptions ... ok 146 ms > >I've extracted the crash-causing fragment from the partition_prune test >to reproduce the segfault reliably (see the patch attached). >The segfault stack is: >Core was generated by `postgres: parallel worker for PID >12029 '. >Program terminated with signal 11, Segmentation fault. >#0 0x00007f045e0a88ca in notifyFreed (K=<optimized out>, Obj=..., >this=<optimized out>) > at >/usr/src/debug/llvm-7.0.1.src/lib/ExecutionEngine/Orc/OrcCBindingsStack.h:485 >485 Listener->NotifyFreeingObject(Obj); >(gdb) bt >#0 0x00007f045e0a88ca in notifyFreed (K=<optimized out>, Obj=..., >this=<optimized out>) > at >/usr/src/debug/llvm-7.0.1.src/lib/ExecutionEngine/Orc/OrcCBindingsStack.h:485 >#1 operator() (K=<optimized out>, Obj=..., __closure=<optimized out>) > at >/usr/src/debug/llvm-7.0.1.src/lib/ExecutionEngine/Orc/OrcCBindingsStack.h:226 >#2 std::_Function_handler<void (unsigned long, llvm::object::ObjectFile >const&), >llvm::OrcCBindingsStack::OrcCBindingsStack(llvm::TargetMachine&, >std::function<std::unique_ptr<llvm::orc::IndirectStubsManager, >std::default_delete<llvm::orc::IndirectStubsManager> > >()>)::{lambda(unsigned long, llvm::object::ObjectFile >const&)#3}>::_M_invoke(std::_Any_data const&, unsigned long, >llvm::object::ObjectFile const&) (__functor=..., __args#0=<optimized >out>, __args#1=...) > at /usr/include/c++/4.8.2/functional:2071 >#3 0x00007f045e0aa578 in operator() (__args#1=..., __args#0=<optimized >out>, this=<optimized out>) > at /usr/include/c++/4.8.2/functional:2471 >... > >The corresponding code in OrcCBindingsStack.h is: >void notifyFreed(orc::VModuleKey K, const object::ObjectFile &Obj) { > for (auto &Listener : EventListeners) > Listener->NotifyFreeingObject(Obj); >} >So probably one of the EventListeners has become null. I see that >without debugging and profiling enabled the only listener registration >in the postgres code is LLVMOrcRegisterJITEventListener. > >With LLVM 9 on the same Centos 7 I don't get such segfault. Also it >doesn't happen on different OSes with LLVM 7. That just like an llvm bug to me. Rather than the usage issue addressed in this thread. I still have no >explanation for that, but maybe there is difference between LLVM >configure options, e.g. like this: >https://stackoverflow.com/questions/47712670/segmentation-fault-in-llvm-pass-when-using-registerstandardpasses Why is it not much more likely that bugs were fixed? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
pgsql-hackers by date: