Re: Strange failure on mamba - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Strange failure on mamba
Date
Msg-id 20221130054225.3ydn5bxdrmel5ssu@awork3.anarazel.de
Whole thread Raw
In response to Re: Strange failure on mamba  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Strange failure on mamba
List pgsql-hackers
Hi,

On 2022-11-29 20:44:34 -0500, Tom Lane wrote:
> Thanks to commit 51b5834cd I've now been able to capture some info
> from mamba's last couple of failures [1][2].  Sure enough, what is
> happening is that postmaster children are getting stuck in recursive
> rtld symbol resolution.  A couple of the stack traces I collected are
> 
> #0  0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
> #1  0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
> #2  0xfdee59e4 in dlopen () from /usr/libexec/ld.elf_so
> #3  0x01e54ed0 in internal_load_library (
>     libname=libname@entry=0xfd74cc88
"/home/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/home/buildfarm/bf-data/HEAD/inst/lib/postgresql/libpqwalreceiver.so")
atdfmgr.c:239
 
> #4  0x01e55c78 in load_file (filename=<optimized out>, restricted=<optimized out>) at dfmgr.c:156
> #5  0x01c5ba24 in WalReceiverMain () at walreceiver.c:292
> #6  0x01c090f8 in AuxiliaryProcessMain (auxtype=auxtype@entry=WalReceiverProcess) at auxprocess.c:161
> #7  0x01c10970 in StartChildProcess (type=WalReceiverProcess) at postmaster.c:5310
> #8  0x01c123ac in MaybeStartWalReceiver () at postmaster.c:5475
> #9  MaybeStartWalReceiver () at postmaster.c:5468
> #10 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5131
> #11 <signal handler called>
> #12 0xfdee6b44 in _rtld_symlook_obj () from /usr/libexec/ld.elf_so
> #13 0xfdee6fc0 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
> #14 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
> #15 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
> #16 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
> #17 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
> #18 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
> Backtrace stopped: frame did not save the PC

Do you have any idea why the stack can't be unwound further here? Is it
possibly indicative of a corrupted stack? I guess we'd need to dig into the
the netbsd libc code :(


> which is pretty much just the same thing we were seeing before
> commit 8acd8f869 :->

What libraries is postgres linked against? I don't know whether -z now only
affects the "top-level" dependencies of postgres, or also the dependencies of
shared libraries that haven't been built with -z now.  The only dependencies
that I could see being relevant are libintl and openssl.

You could try if anything changes if you set LD_BIND_NOW, that should trigger
"recursive" dependencies to be loaded eagerly as well.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_dump bugs reported as pg_upgrade bugs
Next
From: David Rowley
Date:
Subject: Re: Non-decimal integer literals