Thread: seawasp failing, maybe in glibc allocator

seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

09 May 2021, 09:23:42

Hi,

Since seawasp's bleeding-edge clang moved to "20210226", it failed
every run except 4, and a couple of days ago it moved to "20210508"
and it's still broken.  It's always like this:

2021-05-09 03:31:37.602 CEST [1678796:171] pg_regress/_int LOG:
statement: RESET enable_seqscan;
corrupted double-linked list

... which doesn't appear in our code, but matches this:

https://github.com/bminor/glibc/blob/cedbf6d5f3f70ca911176de87d6e453eeab4b7a1/malloc/malloc.c#L1645

No reason to think it's our fault, but it'd be nice to see a
backtrace.  Is gdb installed, and are core files being dumped by that
SIGABRT, and are they using the default name
(/proc/sys/kernel/core_pattern = core), which the BF can find with the
value it's using, namely 'core_file_glob' => 'core*'?

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

10 May 2021, 06:59:21

Hello Thomas,

> Since seawasp's bleeding-edge clang moved to "20210226", it failed
> every run except 4, and a couple of days ago it moved to "20210508"
> and it's still broken.

Indeed I have noticed that there is indeed an issue, but the investigation 
is not very high on my current too deep pg-unrelated todo list.

> It's always like this:
>
> 2021-05-09 03:31:37.602 CEST [1678796:171] pg_regress/_int LOG:
> statement: RESET enable_seqscan;
> corrupted double-linked list
>
> ... which doesn't appear in our code, but matches this:
>
> https://github.com/bminor/glibc/blob/cedbf6d5f3f70ca911176de87d6e453eeab4b7a1/malloc/malloc.c#L1645

> No reason to think it's our fault, but it'd be nice to see a
> backtrace.

ISTM it looks like some kind of memory corruption. If I'd have to guess 
between glibc, clang and pg, not sure which one I'd chose between the two 
laters potential bug sources.

> Is gdb installed, and are core files being dumped by that SIGABRT, and 
> are they using the default name (/proc/sys/kernel/core_pattern = core), 
> which the BF can find with the value it's using, namely 'core_file_glob' 
> => 'core*'?

Nope:

   sh> cat /proc/sys/kernel/core_pattern
   |/usr/share/apport/apport %p %s %c %d %P %E

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

10 May 2021, 07:27:29

On Mon, May 10, 2021 at 6:59 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > Is gdb installed, and are core files being dumped by that SIGABRT, and
> > are they using the default name (/proc/sys/kernel/core_pattern = core),
> > which the BF can find with the value it's using, namely 'core_file_glob'
> > => 'core*'?
>
> Nope:
>
>    sh> cat /proc/sys/kernel/core_pattern
>    |/usr/share/apport/apport %p %s %c %d %P %E

If you don't care about Ubuntu "apport" on this system (something for
sending crash/bug reports to developers with a GUI), you could
uninstall it (otherwise it overwrites the core_pattern every time it
restarts, no matter what you write in your sysctl.conf, apparently),
and then sudo sysctl -w kernel.core_pattern=core to undo the setting
immediately (or reboot).  Then hopefully the build farm would succeed
in dumping a backtrace into the log.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

10 May 2021, 09:30:31

> If you don't care about Ubuntu "apport" on this system (something for
> sending crash/bug reports to developers with a GUI), you could
> uninstall it (otherwise it overwrites the core_pattern every time it
> restarts, no matter what you write in your sysctl.conf, apparently),
> and then sudo sysctl -w kernel.core_pattern=core to undo the setting
> immediately (or reboot).  Then hopefully the build farm would succeed
> in dumping a backtrace into the log.

I forced-removed apport (which meant removing xserver-xorg). Let's see 
whether the reports are better or whether I break something.

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

10 May 2021, 10:52:24

On Mon, May 10, 2021 at 9:30 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> I forced-removed apport (which meant removing xserver-xorg). Let's see
> whether the reports are better or whether I break something.

And of course this time it succeeded :-)

Just by the way, I noticed it takes ~40 minutes to compile.  Is there
a reason you don't install ccache and set eg CC="ccache
/path/to/clang", CXX="ccache /path/to/clang++", CLANG="ccache
/path/to/clang"?  That doesn't seem to conflict with your goal of
testing LLVM/Clang's main branch, because ccache checks the compiler's
mtime as part of its cache invalidation strategy.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

10 May 2021, 11:21:24

> And of course this time it succeeded :-)

Hmmm. ISTM that failures are on and off every few attempts.

> Just by the way, I noticed it takes ~40 minutes to compile.  Is there
> a reason you don't install ccache and set eg CC="ccache
> /path/to/clang", CXX="ccache /path/to/clang++", CLANG="ccache
> /path/to/clang"?  That doesn't seem to conflict with your goal of
> testing LLVM/Clang's main branch, because ccache checks the compiler's
> mtime as part of its cache invalidation strategy.

Yep.

I remember that I disactivated it for some reason once, but I cannot 
remember why. I just reactivated it, will see what happens.

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

11 May 2021, 00:16:44

On Mon, May 10, 2021 at 11:21 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > And of course this time it succeeded :-)
>
> Hmmm. ISTM that failures are on and off every few attempts.

OK we got the SIGABRT this time, but still no backtrace.  If the
kernel's core_pattern is "core", gdb is installed, then considering
that the buildfarm core_file_glob is "core*" and the script version is
recent (REL_12), then I'm out of ideas.  ulimit -c XXX shouldn't be
needed because the perl script does that with rlimit.

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

11 May 2021, 01:08:48

On 2021-05-11 12:16:44 +1200, Thomas Munro wrote:
> OK we got the SIGABRT this time, but still no backtrace.  If the
> kernel's core_pattern is "core", gdb is installed, then considering
> that the buildfarm core_file_glob is "core*" and the script version is
> recent (REL_12), then I'm out of ideas.  ulimit -c XXX shouldn't be
> needed because the perl script does that with rlimit.

Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show
that.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

11 May 2021, 08:22:02

> On 2021-05-11 12:16:44 +1200, Thomas Munro wrote:
>> OK we got the SIGABRT this time, but still no backtrace.  If the
>> kernel's core_pattern is "core", gdb is installed, then considering
>> that the buildfarm core_file_glob is "core*" and the script version is
>> recent (REL_12), then I'm out of ideas.  ulimit -c XXX shouldn't be
>> needed because the perl script does that with rlimit.
>
> Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show
> that.

Possibly I have just added "ulimit -c unlimited" in the script, we should 
see the effect on next round.

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

11 May 2021, 17:13:42

On 2021-05-11 10:22:02 +0200, Fabien COELHO wrote:
> 
> > On 2021-05-11 12:16:44 +1200, Thomas Munro wrote:
> > > OK we got the SIGABRT this time, but still no backtrace.  If the
> > > kernel's core_pattern is "core", gdb is installed, then considering
> > > that the buildfarm core_file_glob is "core*" and the script version is
> > > recent (REL_12), then I'm out of ideas.  ulimit -c XXX shouldn't be
> > > needed because the perl script does that with rlimit.
> > 
> > Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show
> > that.
> 
> Possibly I have just added "ulimit -c unlimited" in the script, we should
> see the effect on next round.

If it's the hard limit that won't help, because the hard limit can only
be increased by a privileged process.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

11 May 2021, 21:09:55

Hello Andres,p

>>> Unless perhaps the hard rlimit for -C is set? ulimit -c -H should show
>>> that.
>>
>> Possibly I have just added "ulimit -c unlimited" in the script, we should
>> see the effect on next round.
>
> If it's the hard limit that won't help, because the hard limit can only
> be increased by a privileged process.

It seems to be the soft one, so ISTM that it may work.

Before:

   sh> ulimit -c -S
   0
   sh> ulimit -c -H
   unlimited

Then after:

   sh> ulimit -c unlimited

I have:

   sh> ulimit -c -S
   unlimited
   sh> ulimit -c -H
   unlimited

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

12 May 2021, 18:30:56

> Possibly I have just added "ulimit -c unlimited" in the script, we should see 
> the effect on next round.

for def5b065 it ended on on the contrib ltree test:

  2021-05-12 20:12:52.528 CEST [3042602:410] pg_regress/ltree LOG: disconnection: session time: 0:00:13.426
user=buildfarm database=contrib_regression_ltree host=[local]
 

  /home/fabien/pg/build-farm-12/buildroot/HEAD/pgsql.build/contrib/ltree/results/ltree.out
  2021-05-12 20:12:52.523330311 +0200
  @@ -7931,11 +7931,8 @@
   (1 row)

   SELECT count(*) FROM _ltreetest WHERE t ~ '1.1.1.*' ;
  - count
  --------
  -    19
  -(1 row)
  -
  +ERROR:  stack depth limit exceeded
  +HINT:  Increase the configuration parameter "max_stack_depth" (currently  2048kB), after ensuring the platform's
stackdepth limit is adequate.
 
   SELECT count(*) FROM _ltreetest WHERE t ~ '*.1' ;
    count
   -------

Not a core dump, though.

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

15 May 2021, 06:41:27

Hello Andres,

It finally failed with a core on 8f72bba, in llvm_shutdown, AFAIKS in a 
free while doing malloc-related housekeeping.

My guess is that there is an actual memory corruption somewhere. It is 
unobvious whether it is in bleeding-edge llvm or bleeding-edge postgres 
though.

The issue is non-deterministically triggered in contrib checks, either in 
int or ltree, but not elsewhere. This suggests issues specific to these 
modules, or triggered by these modules. Hmmm…

I've just launched a run with valgrind enabled.

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

16 May 2021, 10:58:49

On Sat, May 15, 2021 at 6:41 PM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> The issue is non-deterministically triggered in contrib checks, either in
> int or ltree, but not elsewhere. This suggests issues specific to these
> modules, or triggered by these modules. Hmmm…

Hmm, yeah.  A couple of different ways that ltreetest fails without crashing:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-13%2001%3A17%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-12%2017%3A17%3A15

Otherwise it's mostly free() blowing up, and one case of an assertion
failure in llvm::StringMapImpl::RemoveKey, I guess before free() is
reached:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-14%2009%3A17%3A15

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

18 May 2021, 17:02:20

>> The issue is non-deterministically triggered in contrib checks, either in
>> int or ltree, but not elsewhere. This suggests issues specific to these
>> modules, or triggered by these modules. Hmmm…
>
> Hmm, yeah.  A couple of different ways that ltreetest fails without crashing:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-13%2001%3A17%3A15
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-12%2017%3A17%3A15
>
> Otherwise it's mostly free() blowing up, and one case of an assertion
> failure in llvm::StringMapImpl::RemoveKey, I guess before free() is
> reached:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=seawasp&dt=2021-05-14%2009%3A17%3A15

It seems that the upload of the valgrind run (many hours…) failed on "413 
request entity too large", and everything seems to have been cleaned 
despite the "--keepall" I think I put when I started the run.

Not sure about the best way to proceed.

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

21 May 2021, 09:37:22

On Wed, May 19, 2021 at 5:02 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> It seems that the upload of the valgrind run (many hours…) failed on "413
> request entity too large", and everything seems to have been cleaned
> despite the "--keepall" I think I put when I started the run.

I installed Clang/LLVM version
"1:13~++20210520071732+02f2d739e074-1~exp1~20210520052519.57" from
https://apt.llvm.org/ on a Debian buster box, and I saw that
contrib/ltree's test fail about half the time with a range of weird
and wonderful outputs (wrong answers) similar to seawasp, but it never
crashed.  I ran it under valgrind and I managed to get:

==529250== Invalid read of size 8
==529250==    at 0x1475B6CA: void
std::vector<llvm::orc::SymbolStringPtr,
std::allocator<llvm::orc::SymbolStringPtr>
>::_M_realloc_insert<llvm::orc::SymbolStringPtr
const&>(__gnu_cxx::__normal_iterator<llvm::orc::SymbolStringPtr*,
std::vector<llvm::orc::SymbolStringPtr,
std::allocator<llvm::orc::SymbolStringPtr> > >,
llvm::orc::SymbolStringPtr const&) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x1474F246:
llvm::orc::JITDylib::removeTracker(llvm::orc::ResourceTracker&) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x14741E0A:
llvm::orc::ExecutionSession::removeResourceTracker(llvm::orc::ResourceTracker&)
(in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x14747421: llvm::orc::JITDylib::clear() (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x14751F5B: llvm::orc::ExecutionSession::endSession()
(in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x14785207: llvm::orc::LLJIT::~LLJIT() (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x147A4C7D: LLVMOrcDisposeLLJIT (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x124D0094: llvm_shutdown (llvmjit.c:892)
==529250==    by 0x57BC08: proc_exit_prepare (ipc.c:209)
==529250==    by 0x57BC77: proc_exit (ipc.c:107)
==529250==    by 0x5A545B: PostgresMain (postgres.c:4700)
==529250==    by 0x517569: BackendRun (postmaster.c:4491)
==529250==    by 0x517569: BackendStartup (postmaster.c:4213)
==529250==    by 0x517569: ServerLoop (postmaster.c:1745)
==529250==  Address 0x1a969088 is 1,416 bytes inside a block of size
3,072 free'd
==529250==    at 0x4839EAB: operator delete(void*) (vg_replace_malloc.c:584)
==529250==    by 0x141DFD8E: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x141DD9C0: llvm::LazyValueInfo::releaseMemory() (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x13294832:
llvm::PMDataManager::freePass(llvm::Pass*, llvm::StringRef,
llvm::PassDebuggingString) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x13294774:
llvm::PMDataManager::removeDeadPasses(llvm::Pass*, llvm::StringRef,
llvm::PassDebuggingString) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x13290A69:
llvm::FPPassManager::runOnFunction(llvm::Function&) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x14137632: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x13290FDE:
llvm::legacy::PassManagerImpl::run(llvm::Module&) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x131F8625: LLVMRunPassManager (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x124D0772: llvm_optimize_module (llvmjit.c:620)
==529250==    by 0x124D0772: llvm_compile_module (llvmjit.c:671)
==529250==    by 0x124D0772: llvm_get_function (llvmjit.c:291)
==529250==    by 0x124DA3BC: ExecRunCompiledExpr (llvmjit_expr.c:2402)
==529250==    by 0x41D15C: ExecEvalExprSwitchContext (executor.h:339)
==529250==    by 0x41D15C: ExecQual (executor.h:408)
==529250==    by 0x41D15C: ExecScan (execScan.c:227)
==529250==  Block was alloc'd at
==529250==    at 0x4838DEF: operator new(unsigned long)
(vg_replace_malloc.c:342)
==529250==    by 0x141E47D6: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x141E43B7: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x141E3202: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x141E0501: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x141DDD74: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x141DDC4C:
llvm::LazyValueInfo::getConstant(llvm::Value*, llvm::Instruction*) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x13D0EE4B: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x13D11D42: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x1329098C:
llvm::FPPassManager::runOnFunction(llvm::Function&) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x14137632: ??? (in /usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==    by 0x13290FDE:
llvm::legacy::PassManagerImpl::run(llvm::Module&) (in
/usr/lib/x86_64-linux-gnu/libLLVM-13.so.1)
==529250==

Maybe they should rewrite LLVM in Rust.

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

21 May 2021, 21:58:38

Hi,

On 2021-05-21 21:37:22 +1200, Thomas Munro wrote:
> I installed Clang/LLVM version
> "1:13~++20210520071732+02f2d739e074-1~exp1~20210520052519.57" from
> https://apt.llvm.org/ on a Debian buster box, and I saw that
> contrib/ltree's test fail about half the time with a range of weird
> and wonderful outputs (wrong answers) similar to seawasp, but it never
> crashed.  I ran it under valgrind and I managed to get:

Interesting. I tried this with a slightly older LLVM checkout
(6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an
issue, even if I force everything to be jitted+inlined+optimized. The
git hash in the package version indicates the commit is from
2021-05-20. Upgrading my local checkout to see whether I can repro the
problem. If I can we at least have a not too large bisection window...

Greetings,

Andres Freund

Re: seawasp failing, maybe in glibc allocator

From

Tom Lane

Date:

21 May 2021, 22:18:54

Andres Freund <andres@anarazel.de> writes:
> Interesting. I tried this with a slightly older LLVM checkout
> (6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an
> issue, even if I force everything to be jitted+inlined+optimized. The
> git hash in the package version indicates the commit is from
> 2021-05-20. Upgrading my local checkout to see whether I can repro the
> problem. If I can we at least have a not too large bisection window...

We know that seawasp was okay as of

configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git
f22d3813850f9e87c5204df6844a93b8c5db7730)

and not okay as of

configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git
0e8f5e4a6864839d2292ec1ddfe48b6178c01f85)

so that should correspond to a window of about a week, if I gather
Fabien's update strategy correctly.

            regards, tom lane

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

21 May 2021, 22:26:19

Hi,

On 2021-05-21 14:58:38 -0700, Andres Freund wrote:
> Interesting. I tried this with a slightly older LLVM checkout
> (6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an
> issue, even if I force everything to be jitted+inlined+optimized. The
> git hash in the package version indicates the commit is from
> 2021-05-20. Upgrading my local checkout to see whether I can repro the
> problem. If I can we at least have a not too large bisection window...

After resolving some PEBKAC issues I was able to reproduce the issue
after a git pull for fresh llvm sources and rebuilding.  Hope I can
narrow it down without needing to bisect, there's faster building
projects than LLVM...

Greetings,

Andres Freund

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

21 May 2021, 22:57:01

Hi,

On 2021-05-21 18:18:54 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Interesting. I tried this with a slightly older LLVM checkout
> > (6f4f0afaa8ae), from 2021-04-20, contrib/ltree tests run without an
> > issue, even if I force everything to be jitted+inlined+optimized. The
> > git hash in the package version indicates the commit is from
> > 2021-05-20. Upgrading my local checkout to see whether I can repro the
> > problem. If I can we at least have a not too large bisection window...
> 
> We know that seawasp was okay as of
> 
> configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git
f22d3813850f9e87c5204df6844a93b8c5db7730)
> 
> and not okay as of
> 
> configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git
0e8f5e4a6864839d2292ec1ddfe48b6178c01f85)
> 
> so that should correspond to a window of about a week, if I gather
> Fabien's update strategy correctly.

I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671).
Contacting the author and reading the change to see if I can spit the
issue myself.

Greetings,

Andres Freund

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

22 May 2021, 00:25:10

Hi,

On 2021-05-21 15:57:01 -0700, Andres Freund wrote:
> I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671).
> Contacting the author and reading the change to see if I can spit the
> issue myself.

Hrmpf. It's a silent API breakage. The author intended to email us about
it, but apparently forgot. One now needs to increment a string-pool
refcount. The reason that didn't trigger a reliable crash is that
there's a path where the refcount of string-pool entries aren't asserted
to be above before decrementing the refcount... And that there
practically never are references to the pool entries after use.

Continuing to discusss whether there's a better way to deal with this.

Greetings,

Andres Freund

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

22 May 2021, 08:08:46

> We know that seawasp was okay as of
>
> configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git
f22d3813850f9e87c5204df6844a93b8c5db7730)
>
> and not okay as of
>
> configure: using compiler=clang version 13.0.0 (https://github.com/llvm/llvm-project.git
0e8f5e4a6864839d2292ec1ddfe48b6178c01f85)
>
> so that should correspond to a window of about a week, if I gather
> Fabien's update strategy correctly.

Indeed, the full recompilation is triggered once a week, on Saturdays.

If the build fails for some reason the previous version is kept and I may 
or may not have time to look at it and try to fix it for some time after 
that, or I may happen that I do not notice the issue for some time…

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

19 June 2021, 05:07:51

On Sat, May 22, 2021 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-05-21 15:57:01 -0700, Andres Freund wrote:
> > I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671).
> > Contacting the author and reading the change to see if I can spit the
> > issue myself.
>
> Hrmpf. It's a silent API breakage. The author intended to email us about
> it, but apparently forgot. One now needs to increment a string-pool
> refcount. The reason that didn't trigger a reliable crash is that
> there's a path where the refcount of string-pool entries aren't asserted
> to be above before decrementing the refcount... And that there
> practically never are references to the pool entries after use.
>
> Continuing to discusss whether there's a better way to deal with this.

Any news?

FWIW this change appears to fix the problem for my system (LLVM 13
build from a couple of days ago).  No more weird results, valgrind
errors gone.  I ran the leak checker to see if I now had the opposite
problem, and although there are various leaks reported, I didn't see
obvious intern pool related stacks.

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 71029a39a9..7b09e520f5 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -1116,6 +1116,11 @@
llvm_resolve_symbols(LLVMOrcDefinitionGeneratorRef GeneratorObj, void
*Ctx,
        if (error != LLVMErrorSuccess)
                LLVMOrcDisposeMaterializationUnit(mu);

+#if LLVM_VERSION_MAJOR > 12
+       for (int i = 0; i < LookupSetSize; i++)
+               LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name);
+#endif
+
        pfree(symbols);

        return error;

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

19 June 2021, 09:37:43

On Sat, Jun 19, 2021 at 5:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>         if (error != LLVMErrorSuccess)
>                 LLVMOrcDisposeMaterializationUnit(mu);
>
> +#if LLVM_VERSION_MAJOR > 12
> +       for (int i = 0; i < LookupSetSize; i++)
> +               LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name);
> +#endif

(Though, erm, that code probably either needs to move a bit further up
or become conditional, considering the error case immediately above
it, not sure which...)

Re: seawasp failing, maybe in glibc allocator

From

Tom Lane

Date:

19 June 2021, 14:12:03

Thomas Munro <thomas.munro@gmail.com> writes:
> On Sat, Jun 19, 2021 at 5:07 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> if (error != LLVMErrorSuccess)
>> LLVMOrcDisposeMaterializationUnit(mu);
>> 
>> +#if LLVM_VERSION_MAJOR > 12
>> +       for (int i = 0; i < LookupSetSize; i++)
>> +               LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name);
>> +#endif

> (Though, erm, that code probably either needs to move a bit further up
> or become conditional, considering the error case immediately above
> it, not sure which...)

Is a compile-time conditional really going to be reliable?  See nearby
arguments about compile-time vs run-time checks for libpq features.
It's not clear to me how tightly LLVM binds its headers and running
code.

            regards, tom lane

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

20 June 2021, 10:59:34

Hi,

On 2021-06-19 17:07:51 +1200, Thomas Munro wrote:
> On Sat, May 22, 2021 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2021-05-21 15:57:01 -0700, Andres Freund wrote:
> > > I found the LLVM commit to blame (c8fc5e3ba942057d6c4cdcd1faeae69a28e7b671).
> > > Contacting the author and reading the change to see if I can spit the
> > > issue myself.
> >
> > Hrmpf. It's a silent API breakage. The author intended to email us about
> > it, but apparently forgot. One now needs to increment a string-pool
> > refcount. The reason that didn't trigger a reliable crash is that
> > there's a path where the refcount of string-pool entries aren't asserted
> > to be above before decrementing the refcount... And that there
> > practically never are references to the pool entries after use.
> >
> > Continuing to discusss whether there's a better way to deal with this.
> 
> Any news?

Nothing really. I'd discussed it a bit with the relevant LLVM
maintainer, but we didn't come to a real resolution. He apologized for
not giving a proper heads up - he'd planned to send out an email, but
somehow lost track of that.

Given that any potential solution would be also end up being a versioned
ifdef, I think adding something like what you suggest here is the least
unreasonable solution.

> FWIW this change appears to fix the problem for my system (LLVM 13
> build from a couple of days ago).  No more weird results, valgrind
> errors gone.  I ran the leak checker to see if I now had the opposite
> problem, and although there are various leaks reported, I didn't see
> obvious intern pool related stacks.
> 
> diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
> index 71029a39a9..7b09e520f5 100644
> --- a/src/backend/jit/llvm/llvmjit.c
> +++ b/src/backend/jit/llvm/llvmjit.c
> @@ -1116,6 +1116,11 @@
> llvm_resolve_symbols(LLVMOrcDefinitionGeneratorRef GeneratorObj, void
> *Ctx,
>         if (error != LLVMErrorSuccess)
>                 LLVMOrcDisposeMaterializationUnit(mu);
> 
> +#if LLVM_VERSION_MAJOR > 12
> +       for (int i = 0; i < LookupSetSize; i++)
> +               LLVMOrcRetainSymbolStringPoolEntry(symbols[i].Name);
> +#endif
> +
>         pfree(symbols);

I think this should be part of the earlier loop? Once
LLVMOrcAbsoluteSymbols() is called that owns the reference, so there
doesn't seem to be a reason to increase the refcount only later?

Greetings,

Andres Freund

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

20 June 2021, 11:01:07

Hi,

On 2021-06-19 10:12:03 -0400, Tom Lane wrote:
> Is a compile-time conditional really going to be reliable?  See nearby
> arguments about compile-time vs run-time checks for libpq features.
> It's not clear to me how tightly LLVM binds its headers and running
> code.

It should be fine (and if not we have plenty other places it'd be
problematic). LLVM embeds the version between user of llvm and the
library version in some symbol, so if there's a sufficient mismatch
it'll cause link time issues. Of course that only works for major
versions, but that shouldn't be an issue here.

Greetings,

Andres Freund

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

20 June 2021, 23:45:12

On Sun, Jun 20, 2021 at 10:59 PM Andres Freund <andres@anarazel.de> wrote:
> I think this should be part of the earlier loop? Once
> LLVMOrcAbsoluteSymbols() is called that owns the reference, so there
> doesn't seem to be a reason to increase the refcount only later?

Right, that makes sense.  Here's a patch like that.

Looking at their release schedule on https://llvm.org/, I see we have
a gamble to make.  They currently plan to cut RC1 at the end of July,
and to release in late September (every second LLVM major release
coincides approximately with a PG major release).  Option 1: wait
until we branch for 14, and then push this to master so that at least
seawasp can get back to looking for new problems, and then back-patch
only after they release (presumably in time for our November
releases).  If their API change sticks, PostgreSQL crashes and gives
weird results with the initial release of LLVM 13 until our fix comes
out.  Option 2: get ahead of their release and get this into 14 +
August back branch releases based on their current/RC behaviour.  If
they decide to revert the change before the final release, we'll leak
symbol names because we hold an extra reference, until we can fix
that.

For the last round of changes[1], there was a similar when-to-act
question, but that was a doesn't-compile-anymore API change, whereas
this is a silent demons-might-fly-out-of-your-nose API change.

[1] https://www.postgresql.org/message-id/flat/20201016011244.pmyvr3ee2gbzplq4%40alap3.anarazel.de

Attachment

0001-Adjust-for-LLVM-13-API-change.patch

Re: seawasp failing, maybe in glibc allocator

From

Tom Lane

Date:

20 June 2021, 23:56:56

Thomas Munro <thomas.munro@gmail.com> writes:
> Looking at their release schedule on https://llvm.org/, I see we have
> a gamble to make.  They currently plan to cut RC1 at the end of July,
> and to release in late September (every second LLVM major release
> coincides approximately with a PG major release).  Option 1: wait
> until we branch for 14, and then push this to master so that at least
> seawasp can get back to looking for new problems, and then back-patch
> only after they release (presumably in time for our November
> releases).  If their API change sticks, PostgreSQL crashes and gives
> weird results with the initial release of LLVM 13 until our fix comes
> out.  Option 2: get ahead of their release and get this into 14 +
> August back branch releases based on their current/RC behaviour.  If
> they decide to revert the change before the final release, we'll leak
> symbol names because we hold an extra reference, until we can fix
> that.

If that's an accurate characterization of the tradeoff, I have little
difficulty in voting for #2.  A crash is strictly worse than a memory
leak.  Besides which, I've heard little indication that they might
revert.

            regards, tom lane

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

21 June 2021, 01:23:34

On Sun, Jun 20, 2021 at 11:01 PM Andres Freund <andres@anarazel.de> wrote:
> On 2021-06-19 10:12:03 -0400, Tom Lane wrote:
> > Is a compile-time conditional really going to be reliable?  See nearby
> > arguments about compile-time vs run-time checks for libpq features.
> > It's not clear to me how tightly LLVM binds its headers and running
> > code.
>
> It should be fine (and if not we have plenty other places it'd be
> problematic). LLVM embeds the version between user of llvm and the
> library version in some symbol, so if there's a sufficient mismatch
> it'll cause link time issues. Of course that only works for major
> versions, but that shouldn't be an issue here.

I looked into this a bit.  On the usual Unixoid server OSes the first
line of defence is that the major version is baked into the library
name to support parallel installation of major versions, so our
llvmjit.so is linked against eg libLLVM-13.so.1 (all controlled by
llvm-config), and then there are further defences like versioned
symbols, LLVM_13 etc on some platforms.

Curiously, they skip this scheme for Macs (see their AddLLVM.cmake
file) and Windows.  So of course I wanted to try to see if I could
make it break in the way Tom imagined, on a Mac.  There, I use
MacPorts, and it has separate packages for major versions, for example
"llvm-12", much like other distros.  The package maintainers put
libLLVM.dylib (LLVM project's choice for this platform) into different
paths under .../libexec/llvm-$VERSION/.... (package maintainer's
choice), and there is a tool to select the current default (alarm
bells ringing at this point).  The first observation is that the
Mach-O "compatibility version" is 1.0.0 on all the .dylibs, so yeah,
that mechanism isn't going to save you, but ... it turns out to be a
moot question for now because, to my surprise, we're statically
linking LLVM into our llvmjit.so on that platform.  That turns out to
be because llvm-config --libs won't spit out dynamic link options if
it can't find a library name with the version embedded in it.  I see
now that Brew's maintainers take it on themselves to create that
symlink[1] (unlike MacPorts'), so ... erm, could be trouble there, I
dunno because I don't want to install that, but if so, maybe they
asked for it?  I guess that none of this stuff really matters for real
world non-hacker users, who are probably using an installer that ships
its own copy of the thing.  I expect it'll be the same on Windows when
we eventually support LLVM there.  /me closes Macintosh

[1] https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/llvm.rb#L175

Re: seawasp failing, maybe in glibc allocator

From

Andres Freund

Date:

21 June 2021, 10:23:25

Hi,

On 2021-06-20 19:56:56 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Looking at their release schedule on https://llvm.org/, I see we have
> > a gamble to make.  They currently plan to cut RC1 at the end of July,
> > and to release in late September (every second LLVM major release
> > coincides approximately with a PG major release).  Option 1: wait
> > until we branch for 14, and then push this to master so that at least
> > seawasp can get back to looking for new problems, and then back-patch
> > only after they release (presumably in time for our November
> > releases).  If their API change sticks, PostgreSQL crashes and gives
> > weird results with the initial release of LLVM 13 until our fix comes
> > out.  Option 2: get ahead of their release and get this into 14 +
> > August back branch releases based on their current/RC behaviour.  If
> > they decide to revert the change before the final release, we'll leak
> > symbol names because we hold an extra reference, until we can fix
> > that.

I think I'd vote for 2 or 2+ (backpatch immediately).


> If that's an accurate characterization of the tradeoff, I have little
> difficulty in voting for #2.  A crash is strictly worse than a memory
> leak.  Besides which, I've heard little indication that they might
> revert.

We might be able to get them to revert and put in a different API, but I
don't think it'd clearly be an improvement at this point.

Greetings,

Andres Freund

Re: seawasp failing, maybe in glibc allocator

From

Thomas Munro

Date:

24 June 2021, 23:38:01

On Mon, Jun 21, 2021 at 11:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If that's an accurate characterization of the tradeoff, I have little
> difficulty in voting for #2.  A crash is strictly worse than a memory
> leak.  Besides which, I've heard little indication that they might
> revert.

Agreed.

On Mon, Jun 21, 2021 at 10:23 PM Andres Freund <andres@anarazel.de> wrote:
> I think I'd vote for 2 or 2+ (backpatch immediately).

Yeah, that makes sense.  Done.

Seawasp should turn green on its next run.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

26 June 2021, 06:09:25

Hello Thomas,

> Seawasp should turn green on its next run.

Hopefully.

It is not scheduled very soon because Tom complained about the induced 
noise in one buildfarm report, so I put the check to once a week.

I changed it to start a run in a few minutes. I've rescheduled to once a 
day after that (previous schedule was a check every hour).

-- 
Fabien.

Re: seawasp failing, maybe in glibc allocator

From

Fabien COELHO

Date:

26 June 2021, 07:15:48

>> Seawasp should turn green on its next run.

It did!

-- 
Fabien.