Thread: seawasp failing, maybe in glibc allocator
Hi, Since seawasp's bleeding-edge clang moved to "20210226", it failed every run except 4, and a couple of days ago it moved to "20210508" and it's still broken. It's always like this: 2021-05-09 03:31:37.602 CEST [1678796:171] pg_regress/_int LOG: statement: RESET enable_seqscan; corrupted double-linked list ... which doesn't appear in our code, but matches this: https://github.com/bminor/glibc/blob/cedbf6d5f3f70ca911176de87d6e453eeab4b7a1/malloc/malloc.c#L1645 No reason to think it's our fault, but it'd be nice to see a backtrace. Is gdb installed, and are core files being dumped by that SIGABRT, and are they using the default name (/proc/sys/kernel/core_pattern = core), which the BF can find with the value it's using, namely 'core_file_glob' => 'core*'?
Hello Thomas, > Since seawasp's bleeding-edge clang moved to "20210226", it failed > every run except 4, and a couple of days ago it moved to "20210508" > and it's still broken. Indeed I have noticed that there is indeed an issue, but the investigation is not very high on my current too deep pg-unrelated todo list. > It's always like this: > > 2021-05-09 03:31:37.602 CEST [1678796:171] pg_regress/_int LOG: > statement: RESET enable_seqscan; > corrupted double-linked list > > ... which doesn't appear in our code, but matches this: > > https://github.com/bminor/glibc/blob/cedbf6d5f3f70ca911176de87d6e453eeab4b7a1/malloc/malloc.c#L1645 > No reason to think it's our fault, but it'd be nice to see a > backtrace. ISTM it looks like some kind of memory corruption. If I'd have to guess between glibc, clang and pg, not sure which one I'd chose between the two laters potential bug sources. > Is gdb installed, and are core files being dumped by that SIGABRT, and > are they using the default name (/proc/sys/kernel/core_pattern = core), > which the BF can find with the value it's using, namely 'core_file_glob' > => 'core*'? Nope: sh> cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport %p %s %c %d %P %E -- Fabien.
On Mon, May 10, 2021 at 6:59 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > Is gdb installed, and are core files being dumped by that SIGABRT, and > > are they using the default name (/proc/sys/kernel/core_pattern = core), > > which the BF can find with the value it's using, namely 'core_file_glob' > > => 'core*'? > > Nope: > > sh> cat /proc/sys/kernel/core_pattern > |/usr/share/apport/apport %p %s %c %d %P %E If you don't care about Ubuntu "apport" on this system (something for sending crash/bug reports to developers with a GUI), you could uninstall it (otherwise it overwrites the core_pattern every time it restarts, no matter what you write in your sysctl.conf, apparently), and then sudo sysctl -w kernel.core_pattern=core to undo the setting immediately (or reboot). Then hopefully the build farm would succeed in dumping a backtrace into the log.
> If you don't care about Ubuntu "apport" on this system (something for > sending crash/bug reports to developers with a GUI), you could > uninstall it (otherwise it overwrites the core_pattern every time it > restarts, no matter what you write in your sysctl.conf, apparently), > and then sudo sysctl -w kernel.core_pattern=core to undo the setting > immediately (or reboot). Then hopefully the build farm would succeed > in dumping a backtrace into the log. I forced-removed apport (which meant removing xserver-xorg). Let's see whether the reports are better or whether I break something. -- Fabien.
On Mon, May 10, 2021 at 9:30 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > I forced-removed apport (which meant removing xserver-xorg). Let's see > whether the reports are better or whether I break something. And of course this time it succeeded :-) Just by the way, I noticed it takes ~40 minutes to compile. Is there a reason you don't install ccache and set eg CC="ccache /path/to/clang", CXX="ccache /path/to/clang++", CLANG="ccache /path/to/clang"? That doesn't seem to conflict with your goal of testing LLVM/Clang's main branch, because ccache checks the compiler's mtime as part of its cache invalidation strategy.
> And of course this time it succeeded :-) Hmmm. ISTM that failures are on and off every few attempts. > Just by the way, I noticed it takes ~40 minutes to compile. Is there > a reason you don't install ccache and set eg CC="ccache > /path/to/clang", CXX="ccache /path/to/clang++", CLANG="ccache > /path/to/clang"? That doesn't seem to conflict with your goal of > testing LLVM/Clang's main branch, because ccache checks the compiler's > mtime as part of its cache invalidation strategy. Yep. I remember that I disactivated it for some reason once, but I cannot remember why. I just reactivated it, will see what happens. -- Fabien.
On Mon, May 10, 2021 at 11:21 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > > And of course this time it succeeded :-) > > Hmmm. ISTM that failures are on and off every few attempts. OK we got the SIGABRT this time, but still no backtrace. If the kernel's core_pattern is "core", gdb is installed, then considering that the buildfarm core_file_glob is "core*" and the script version is recent (REL_12), then I'm out of ideas. ulimit -c XXX shouldn't be needed because the perl script does that with rlimit.
On 2021-05-11 12:16:44 +1200, Thomas Munro wrote: > OK we got the SIGABRT this time, but still no backtrace. If the > kernel's core_pattern is "core", gdb is installed, then considering > that the buildfarm core_file_glob is "core*" and the script version is > recent (REL_12), then I'm out of ideas. ulimit -c XXX shouldn't be > needed because the perl script does that with rlimit. Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show that.
> On 2021-05-11 12:16:44 +1200, Thomas Munro wrote: >> OK we got the SIGABRT this time, but still no backtrace. If the >> kernel's core_pattern is "core", gdb is installed, then considering >> that the buildfarm core_file_glob is "core*" and the script version is >> recent (REL_12), then I'm out of ideas. ulimit -c XXX shouldn't be >> needed because the perl script does that with rlimit. > > Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show > that. Possibly I have just added "ulimit -c unlimited" in the script, we should see the effect on next round. -- Fabien.
On 2021-05-11 10:22:02 +0200, Fabien COELHO wrote: > > > On 2021-05-11 12:16:44 +1200, Thomas Munro wrote: > > > OK we got the SIGABRT this time, but still no backtrace. If the > > > kernel's core_pattern is "core", gdb is installed, then considering > > > that the buildfarm core_file_glob is "core*" and the script version is > > > recent (REL_12), then I'm out of ideas. ulimit -c XXX shouldn't be > > > needed because the perl script does that with rlimit. > > > > Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show > > that. > > Possibly I have just added "ulimit -c unlimited" in the script, we should > see the effect on next round. If it's the hard limit that won't help, because the hard limit can only be increased by a privileged process.
Hello Andres,p >>> Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show >>> that. >> >> Possibly I have just added "ulimit -c unlimited" in the script, we should >> see the effect on next round. > > If it's the hard limit that won't help, because the hard limit can only > be increased by a privileged process. It seems to be the soft one, so ISTM that it may work. Before: sh> ulimit -c -S 0 sh> ulimit -c -H unlimited Then after: sh> ulimit -c unlimited I have: sh> ulimit -c -S unlimited sh> ulimit -c -H unlimited -- Fabien.
> Possibly I have just added "ulimit -c unlimited" in the script, we should see > the effect on next round. for def5b065 it ended on on the contrib ltree test: 2021-05-12 20:12:52.528 CEST [3042602:410] pg_regress/ltree LOG: disconnection: session time: 0:00:13.426 user=buildfarm database=contrib_regression_ltree host=[local] /home/fabien/pg/build-farm-12/buildroot/HEAD/pgsql.build/contrib/ltree/results/ltree.out 2021-05-12 20:12:52.523330311 +0200 @@ -7931,11 +7931,8 @@ (1 row) SELECT count(*) FROM _ltreetest WHERE t ~ '1.1.1.*' ; - count -------- - 19 -(1 row) - +ERROR: stack depth limit exceeded +HINT: Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stackdepth limit is adequate. SELECT count(*) FROM _ltreetest WHERE t ~ '*.1' ; count ------- Not a core dump, though. -- Fabien.
Hello Andres, It finally failed with a core on 8f72bba, in llvm_shutdown, AFAIKS in a free while doing malloc-related housekeeping. My guess is that there is an actual memory corruption somewhere. It is unobvious whether it is in bleeding-edge llvm or bleeding-edge postgres though. The issue is non-deterministically triggered in contrib checks, either in int or ltree, but not elsewhere. This suggests issues specific to these modules, or triggered by these modules. Hmmm… I've just launched a run with valgrind enabled. -- Fabien.
On Sat, May 15, 2021 at 6:41 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > The issue is non-deterministically triggered in contrib checks, either in > int or ltree, but not elsewhere. This suggests issues specific to these > modules, or triggered by these modules. Hmmm… Hmm, yeah. A couple of different ways that ltreetest fails without crashing: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-13%2001%3A17%3A15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-12%2017%3A17%3A15 Otherwise it's mostly free() blowing up, and one case of an assertion failure in llvm::StringMapImpl::RemoveKey, I guess before free() is reached: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-14%2009%3A17%3A15
>> The issue is non-deterministically triggered in contrib checks, either in >> int or ltree, but not elsewhere. This suggests issues specific to these >> modules, or triggered by these modules. Hmmm… > > Hmm, yeah. A couple of different ways that ltreetest fails without crashing: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-13%2001%3A17%3A15 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-12%2017%3A17%3A15 > > Otherwise it's mostly free() blowing up, and one case of an assertion > failure in llvm::StringMapImpl::RemoveKey, I guess before free() is > reached: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-14%2009%3A17%3A15 It seems that the upload of the valgrind run (many hours…) failed on "413 request entity too large", and everything seems to have been cleaned despite the "--keepall" I think I put when I started the run. Not sure about the best way to proceed. -- Fabien.
On Wed, May 19, 2021 at 5:02 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote: > It seems that the upload of the valgrind run (many hours…) failed on "413 > request entity too large", and everything seems to have been cleaned > despite the "--keepall" I think I put when I started the run. I installed Clang/LLVM version "1:13~++20210520071732+02f2d739e074-1~exp1~20210520052519.57" from https://apt.llvm.org/ on a Debian buster box, and I saw that contrib/ltree's test fail about half the time with a range of weird and wonderful outputs (wrong answers) similar to seawasp, but it never crashed. I ran it under valgrind and I managed to get: ==529250== Invalid read of size 8 ==529250== at 0x1475B6CA: void std::vector<llvm::orc::SymbolStringPtr, std::allocator<llvm::orc::SymbolStringPtr> >::_M_realloc_insert<llvm::orc::SymbolStringPtr const&>(__gnu_cxx::__normal_iterator<llvm::orc::SymbolStringPtr*, std::vector<llvm::orc::SymbolStringPtr, std::allocator<llvm::orc::SymbolStringPtr> > >, llvm::orc::SymbolStringPtr const&) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x1474F246: llvm::orc::JITDylib::removeTracker(llvm::orc::ResourceTracker&) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x14741E0A: llvm::orc::ExecutionSession::removeResourceTracker(llvm::orc::ResourceTracker&) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x14747421: llvm::orc::JITDylib::clear() (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x14751F5B: llvm::orc::ExecutionSession::endSession() (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x14785207: llvm::orc::LLJIT::~LLJIT() (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x147A4C7D: LLVMOrcDisposeLLJIT (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x124D0094: llvm_shutdown (llvmjit.c:892) ==529250== by 0x57BC08: proc_exit_prepare (ipc.c:209) ==529250== by 0x57BC77: proc_exit (ipc.c:107) ==529250== by 0x5A545B: PostgresMain (postgres.c:4700) ==529250== by 0x517569: BackendRun (postmaster.c:4491) ==529250== by 0x517569: BackendStartup (postmaster.c:4213) ==529250== by 0x517569: ServerLoop (postmaster.c:1745) ==529250== Address 0x1a969088 is 1,416 bytes inside a block of size 3,072 free'd ==529250== at 0x4839EAB: operator delete(void*) (vg_replace_malloc.c:584) ==529250== by 0x141DFD8E: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x141DD9C0: llvm::LazyValueInfo::releaseMemory() (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x13294832: llvm::PMDataManager::freePass(llvm::Pass*, llvm::StringRef, llvm::PassDebuggingString) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x13294774: llvm::PMDataManager::removeDeadPasses(llvm::Pass*, llvm::StringRef, llvm::PassDebuggingString) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x13290A69: llvm::FPPassManager::runOnFunction(llvm::Function&) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x14137632: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x13290FDE: llvm::legacy::PassManagerImpl::run(llvm::Module&) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x131F8625: LLVMRunPassManager (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x124D0772: llvm_optimize_module (llvmjit.c:620) ==529250== by 0x124D0772: llvm_compile_module (llvmjit.c:671) ==529250== by 0x124D0772: llvm_get_function (llvmjit.c:291) ==529250== by 0x124DA3BC: ExecRunCompiledExpr (llvmjit_expr.c:2402) ==529250== by 0x41D15C: ExecEvalExprSwitchContext (executor.h:339) ==529250== by 0x41D15C: ExecQual (executor.h:408) ==529250== by 0x41D15C: ExecScan (execScan.c:227) ==529250== Block was alloc'd at ==529250== at 0x4838DEF: operator new(unsigned long) (vg_replace_malloc.c:342) ==529250== by 0x141E47D6: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x141E43B7: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x141E3202: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x141E0501: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x141DDD74: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x141DDC4C: llvm::LazyValueInfo::getConstant(llvm::Value*, llvm::Instruction*) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x13D0EE4B: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x13D11D42: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x1329098C: llvm::FPPassManager::runOnFunction(llvm::Function&) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x14137632: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== by 0x13290FDE: llvm::legacy::PassManagerImpl::run(llvm::Module&) (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1) ==529250== Maybe they should rewrite LLVM in Rust.
Hi, On 2021-05-21 21:37:22 +1200, Thomas Munro wrote: > I installed Clang/LLVM version > "1:13~++20210520071732+02f2d739e074-1~exp1~20210520052519.57" from > https://apt.llvm.org/ on a Debian buster box, and I saw that > contrib/ltree's test fail about half the time with a range of weird > and wonderful outputs (wrong answers) similar to seawasp, but it never > crashed. I ran it under valgrind and I managed to get: Interesting. I tried this with a slightly older LLVM checkout (6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an issue, even if I force everything to be jitted+inlined+optimized. The git hash in the package version indicates the commit is from 2021-05-20. Upgrading my local checkout to see whether I can repro the problem. If I can we at least have a not too large bisection window... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Interesting. I tried this with a slightly older LLVM checkout > (6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an > issue, even if I force everything to be jitted+inlined+optimized. The > git hash in the package version indicates the commit is from > 2021-05-20. Upgrading my local checkout to see whether I can repro the > problem. If I can we at least have a not too large bisection window... We know that seawasp was okay as of configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git f22d3813850f9e87c5204df6844a93b8c5db7730) and not okay as of configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git 0e8f5e4a6864839d2292ec1ddfe48b6178c01f85) so that should correspond to a window of about a week, if I gather Fabien's update strategy correctly. regards, tom lane
Hi, On 2021-05-21 14:58:38 -0700, Andres Freund wrote: > Interesting. I tried this with a slightly older LLVM checkout > (6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an > issue, even if I force everything to be jitted+inlined+optimized. The > git hash in the package version indicates the commit is from > 2021-05-20. Upgrading my local checkout to see whether I can repro the > problem. If I can we at least have a not too large bisection window... After resolving some PEBKAC issues I was able to reproduce the issue after a git pull for fresh llvm sources and rebuilding. Hope I can narrow it down without needing to bisect, there's faster building projects than LLVM... Greetings, Andres Freund
Hi, On 2021-05-21 18:18:54 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Interesting. I tried this with a slightly older LLVM checkout > > (6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an > > issue, even if I force everything to be jitted+inlined+optimized. The > > git hash in the package version indicates the commit is from > > 2021-05-20. Upgrading my local checkout to see whether I can repro the > > problem. If I can we at least have a not too large bisection window... > > We know that seawasp was okay as of > > configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git f22d3813850f9e87c5204df6844a93b8c5db7730) > > and not okay as of > > configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git 0e8f5e4a6864839d2292ec1ddfe48b6178c01f85) > > so that should correspond to a window of about a week, if I gather > Fabien's update strategy correctly. I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671). Contacting the author and reading the change to see if I can spit the issue myself. Greetings, Andres Freund
Hi, On 2021-05-21 15:57:01 -0700, Andres Freund wrote: > I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671). > Contacting the author and reading the change to see if I can spit the > issue myself. Hrmpf. It's a silent API breakage. The author intended to email us about it, but apparently forgot. One now needs to increment a string-pool refcount. The reason that didn't trigger a reliable crash is that there's a path where the refcount of string-pool entries aren't asserted to be above before decrementing the refcount... And that there practically never are references to the pool entries after use. Continuing to discusss whether there's a better way to deal with this. Greetings, Andres Freund
> We know that seawasp was okay as of > > configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git f22d3813850f9e87c5204df6844a93b8c5db7730) > > and not okay as of > > configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git 0e8f5e4a6864839d2292ec1ddfe48b6178c01f85) > > so that should correspond to a window of about a week, if I gather > Fabien's update strategy correctly. Indeed, the full recompilation is triggered once a week, on Saturdays. If the build fails for some reason the previous version is kept and I may or may not have time to look at it and try to fix it for some time after that, or I may happen that I do not notice the issue for some time… -- Fabien.
On Sat, May 22, 2021 at 12:25 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-05-21 15:57:01 -0700, Andres Freund wrote: > > I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671). > > Contacting the author and reading the change to see if I can spit the > > issue myself. > > Hrmpf. It's a silent API breakage. The author intended to email us about > it, but apparently forgot. One now needs to increment a string-pool > refcount. The reason that didn't trigger a reliable crash is that > there's a path where the refcount of string-pool entries aren't asserted > to be above before decrementing the refcount... And that there > practically never are references to the pool entries after use. > > Continuing to discusss whether there's a better way to deal with this. Any news? FWIW this change appears to fix the problem for my system (LLVM 13 build from a couple of days ago). No more weird results, valgrind errors gone. I ran the leak checker to see if I now had the opposite problem, and although there are various leaks reported, I didn't see obvious intern pool related stacks. diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c index 71029a39a9..7b09e520f5 100644 --- a/src/backend/jit/llvm/llvmjit.c +++ b/src/backend/jit/llvm/llvmjit.c @@ -1116,6 +1116,11 @@ llvm_resolve_symbols(LLVMOrcDefinitionGeneratorRef GeneratorObj, void *Ctx, if (error != LLVMErrorSuccess) LLVMOrcDisposeMaterializationUnit(mu); +#if LLVM_VERSION_MAJOR > 12 + for (int i = 0; i < LookupSetSize; i++) + LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name); +#endif + pfree(symbols); return error;
On Sat, Jun 19, 2021 at 5:07 PM Thomas Munro <thomas.munro@gmail.com> wrote: > if (error != LLVMErrorSuccess) > LLVMOrcDisposeMaterializationUnit(mu); > > +#if LLVM_VERSION_MAJOR > 12 > + for (int i = 0; i < LookupSetSize; i++) > + LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name); > +#endif (Though, erm, that code probably either needs to move a bit further up or become conditional, considering the error case immediately above it, not sure which...)
Thomas Munro <thomas.munro@gmail.com> writes: > On Sat, Jun 19, 2021 at 5:07 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> if (error != LLVMErrorSuccess) >> LLVMOrcDisposeMaterializationUnit(mu); >> >> +#if LLVM_VERSION_MAJOR > 12 >> + for (int i = 0; i < LookupSetSize; i++) >> + LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name); >> +#endif > (Though, erm, that code probably either needs to move a bit further up > or become conditional, considering the error case immediately above > it, not sure which...) Is a compile-time conditional really going to be reliable? See nearby arguments about compile-time vs run-time checks for libpq features. It's not clear to me how tightly LLVM binds its headers and running code. regards, tom lane
Hi, On 2021-06-19 17:07:51 +1200, Thomas Munro wrote: > On Sat, May 22, 2021 at 12:25 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-05-21 15:57:01 -0700, Andres Freund wrote: > > > I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671). > > > Contacting the author and reading the change to see if I can spit the > > > issue myself. > > > > Hrmpf. It's a silent API breakage. The author intended to email us about > > it, but apparently forgot. One now needs to increment a string-pool > > refcount. The reason that didn't trigger a reliable crash is that > > there's a path where the refcount of string-pool entries aren't asserted > > to be above before decrementing the refcount... And that there > > practically never are references to the pool entries after use. > > > > Continuing to discusss whether there's a better way to deal with this. > > Any news? Nothing really. I'd discussed it a bit with the relevant LLVM maintainer, but we didn't come to a real resolution. He apologized for not giving a proper heads up - he'd planned to send out an email, but somehow lost track of that. Given that any potential solution would be also end up being a versioned ifdef, I think adding something like what you suggest here is the least unreasonable solution. > FWIW this change appears to fix the problem for my system (LLVM 13 > build from a couple of days ago). No more weird results, valgrind > errors gone. I ran the leak checker to see if I now had the opposite > problem, and although there are various leaks reported, I didn't see > obvious intern pool related stacks. > > diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c > index 71029a39a9..7b09e520f5 100644 > --- a/src/backend/jit/llvm/llvmjit.c > +++ b/src/backend/jit/llvm/llvmjit.c > @@ -1116,6 +1116,11 @@ > llvm_resolve_symbols(LLVMOrcDefinitionGeneratorRef GeneratorObj, void > *Ctx, > if (error != LLVMErrorSuccess) > LLVMOrcDisposeMaterializationUnit(mu); > > +#if LLVM_VERSION_MAJOR > 12 > + for (int i = 0; i < LookupSetSize; i++) > + LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name); > +#endif > + > pfree(symbols); I think this should be part of the earlier loop? Once LLVMOrcAbsoluteSymbols() is called that owns the reference, so there doesn't seem to be a reason to increase the refcount only later? Greetings, Andres Freund
Hi, On 2021-06-19 10:12:03 -0400, Tom Lane wrote: > Is a compile-time conditional really going to be reliable? See nearby > arguments about compile-time vs run-time checks for libpq features. > It's not clear to me how tightly LLVM binds its headers and running > code. It should be fine (and if not we have plenty other places it'd be problematic). LLVM embeds the version between user of llvm and the library version in some symbol, so if there's a sufficient mismatch it'll cause link time issues. Of course that only works for major versions, but that shouldn't be an issue here. Greetings, Andres Freund
On Sun, Jun 20, 2021 at 10:59 PM Andres Freund <andres@anarazel.de> wrote: > I think this should be part of the earlier loop? Once > LLVMOrcAbsoluteSymbols() is called that owns the reference, so there > doesn't seem to be a reason to increase the refcount only later? Right, that makes sense. Here's a patch like that. Looking at their release schedule on https://llvm.org/, I see we have a gamble to make. They currently plan to cut RC1 at the end of July, and to release in late September (every second LLVM major release coincides approximately with a PG major release). Option 1: wait until we branch for 14, and then push this to master so that at least seawasp can get back to looking for new problems, and then back-patch only after they release (presumably in time for our November releases). If their API change sticks, PostgreSQL crashes and gives weird results with the initial release of LLVM 13 until our fix comes out. Option 2: get ahead of their release and get this into 14 + August back branch releases based on their current/RC behaviour. If they decide to revert the change before the final release, we'll leak symbol names because we hold an extra reference, until we can fix that. For the last round of changes[1], there was a similar when-to-act question, but that was a doesn't-compile-anymore API change, whereas this is a silent demons-might-fly-out-of-your-nose API change. [1] https://www.postgresql.org/message-id/flat/20201016011244.pmyvr3ee2gbzplq4%40alap3.anarazel.de
Attachment
Thomas Munro <thomas.munro@gmail.com> writes: > Looking at their release schedule on https://llvm.org/, I see we have > a gamble to make. They currently plan to cut RC1 at the end of July, > and to release in late September (every second LLVM major release > coincides approximately with a PG major release). Option 1: wait > until we branch for 14, and then push this to master so that at least > seawasp can get back to looking for new problems, and then back-patch > only after they release (presumably in time for our November > releases). If their API change sticks, PostgreSQL crashes and gives > weird results with the initial release of LLVM 13 until our fix comes > out. Option 2: get ahead of their release and get this into 14 + > August back branch releases based on their current/RC behaviour. If > they decide to revert the change before the final release, we'll leak > symbol names because we hold an extra reference, until we can fix > that. If that's an accurate characterization of the tradeoff, I have little difficulty in voting for #2. A crash is strictly worse than a memory leak. Besides which, I've heard little indication that they might revert. regards, tom lane
On Sun, Jun 20, 2021 at 11:01 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-06-19 10:12:03 -0400, Tom Lane wrote: > > Is a compile-time conditional really going to be reliable? See nearby > > arguments about compile-time vs run-time checks for libpq features. > > It's not clear to me how tightly LLVM binds its headers and running > > code. > > It should be fine (and if not we have plenty other places it'd be > problematic). LLVM embeds the version between user of llvm and the > library version in some symbol, so if there's a sufficient mismatch > it'll cause link time issues. Of course that only works for major > versions, but that shouldn't be an issue here. I looked into this a bit. On the usual Unixoid server OSes the first line of defence is that the major version is baked into the library name to support parallel installation of major versions, so our llvmjit.so is linked against eg libLLVM-13.so.1 (all controlled by llvm-config), and then there are further defences like versioned symbols, LLVM_13 etc on some platforms. Curiously, they skip this scheme for Macs (see their AddLLVM.cmake file) and Windows. So of course I wanted to try to see if I could make it break in the way Tom imagined, on a Mac. There, I use MacPorts, and it has separate packages for major versions, for example "llvm-12", much like other distros. The package maintainers put libLLVM.dylib (LLVM project's choice for this platform) into different paths under .../libexec/llvm-$VERSION/.... (package maintainer's choice), and there is a tool to select the current default (alarm bells ringing at this point). The first observation is that the Mach-O "compatibility version" is 1.0.0 on all the .dylibs, so yeah, that mechanism isn't going to save you, but ... it turns out to be a moot question for now because, to my surprise, we're statically linking LLVM into our llvmjit.so on that platform. That turns out to be because llvm-config --libs won't spit out dynamic link options if it can't find a library name with the version embedded in it. I see now that Brew's maintainers take it on themselves to create that symlink[1] (unlike MacPorts'), so ... erm, could be trouble there, I dunno because I don't want to install that, but if so, maybe they asked for it? I guess that none of this stuff really matters for real world non-hacker users, who are probably using an installer that ships its own copy of the thing. I expect it'll be the same on Windows when we eventually support LLVM there. /me closes Macintosh [1] https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/llvm.rb#L175
Hi, On 2021-06-20 19:56:56 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Looking at their release schedule on https://llvm.org/, I see we have > > a gamble to make. They currently plan to cut RC1 at the end of July, > > and to release in late September (every second LLVM major release > > coincides approximately with a PG major release). Option 1: wait > > until we branch for 14, and then push this to master so that at least > > seawasp can get back to looking for new problems, and then back-patch > > only after they release (presumably in time for our November > > releases). If their API change sticks, PostgreSQL crashes and gives > > weird results with the initial release of LLVM 13 until our fix comes > > out. Option 2: get ahead of their release and get this into 14 + > > August back branch releases based on their current/RC behaviour. If > > they decide to revert the change before the final release, we'll leak > > symbol names because we hold an extra reference, until we can fix > > that. I think I'd vote for 2 or 2+ (backpatch immediately). > If that's an accurate characterization of the tradeoff, I have little > difficulty in voting for #2. A crash is strictly worse than a memory > leak. Besides which, I've heard little indication that they might > revert. We might be able to get them to revert and put in a different API, but I don't think it'd clearly be an improvement at this point. Greetings, Andres Freund
On Mon, Jun 21, 2021 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > If that's an accurate characterization of the tradeoff, I have little > difficulty in voting for #2. A crash is strictly worse than a memory > leak. Besides which, I've heard little indication that they might > revert. Agreed. On Mon, Jun 21, 2021 at 10:23 PM Andres Freund <andres@anarazel.de> wrote: > I think I'd vote for 2 or 2+ (backpatch immediately). Yeah, that makes sense. Done. Seawasp should turn green on its next run.
Hello Thomas, > Seawasp should turn green on its next run. Hopefully. It is not scheduled very soon because Tom complained about the induced noise in one buildfarm report, so I put the check to once a week. I changed it to start a run in a few minutes. I've rescheduled to once a day after that (previous schedule was a check every hour). -- Fabien.
>> Seawasp should turn green on its next run. It did! -- Fabien.