Re: Strange failure on mamba - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Strange failure on mamba
Date
Msg-id 1254382.1669772674@sss.pgh.pa.us
Whole thread Raw
In response to Re: Strange failure on mamba  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Strange failure on mamba
Re: Strange failure on mamba
Re: Strange failure on mamba
List pgsql-hackers
I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> mamba has been showing intermittent failures in various replication
>>> tests since day one.

>> I wonder if it's a runtime variant of the other problem.  We do
>> load_file("libpqwalreceiver", false) before unblocking signals but
>> maybe don't resolve the symbols until calling them, or something like
>> that...

> Yeah, that or some other NetBSD bug could be the explanation, too.
> Without a stack trace it's hard to have any confidence about it,
> but I've been unable to reproduce the problem outside the buildfarm.

Thanks to commit 51b5834cd I've now been able to capture some info
from mamba's last couple of failures [1][2].  Sure enough, what is
happening is that postmaster children are getting stuck in recursive
rtld symbol resolution.  A couple of the stack traces I collected are

#0  0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1  0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2  0xfdee59e4 in dlopen () from /usr/libexec/ld.elf_so
#3  0x01e54ed0 in internal_load_library (
    libname=libname@entry=0xfd74cc88
"/home/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/home/buildfarm/bf-data/HEAD/inst/lib/postgresql/libpqwalreceiver.so")
atdfmgr.c:239 
#4  0x01e55c78 in load_file (filename=<optimized out>, restricted=<optimized out>) at dfmgr.c:156
#5  0x01c5ba24 in WalReceiverMain () at walreceiver.c:292
#6  0x01c090f8 in AuxiliaryProcessMain (auxtype=auxtype@entry=WalReceiverProcess) at auxprocess.c:161
#7  0x01c10970 in StartChildProcess (type=WalReceiverProcess) at postmaster.c:5310
#8  0x01c123ac in MaybeStartWalReceiver () at postmaster.c:5475
#9  MaybeStartWalReceiver () at postmaster.c:5468
#10 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5131
#11 <signal handler called>
#12 0xfdee6b44 in _rtld_symlook_obj () from /usr/libexec/ld.elf_so
#13 0xfdee6fc0 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#14 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#15 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#16 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#17 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#18 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

#0  0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1  0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2  0xfdee4ba4 in _rtld_exit () from /usr/libexec/ld.elf_so
#3  0xfd54ea74 in __cxa_finalize () from /usr/lib/libc.so.12
#4  0xfd54e354 in exit () from /usr/lib/libc.so.12
#5  0x01c963c0 in proc_exit (code=code@entry=0) at ipc.c:152
#6  0x01c056e4 in AutoVacLauncherShutdown () at autovacuum.c:853
#7  0x01c071dc in AutoVacLauncherMain (argv=0x0, argc=0) at autovacuum.c:800
#8  0x01c07694 in StartAutoVacLauncher () at autovacuum.c:416
#9  0x01c11d3c in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3038
#10 <signal handler called>
#11 0xfdee6f64 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#12 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#13 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#14 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#15 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#16 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC

which is pretty much just the same thing we were seeing before
commit 8acd8f869 :-(

Now, we certainly cannot think that these are occurring early in
postmaster startup.  In the wake of 8acd8f869, we should expect
that there's no further need to call rtld_bind at all in the
postmaster, but seemingly that's not so.  It's very frustrating
that the backtrace stops where it does :-(.  It's also strange
that we're apparently running with signals enabled whereever
it is that rtld_bind is getting called from.  Could it be that
sigaction is failing to install the requested signal mask, so
that one postmaster signal handler is interrupting another?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-24%2021%3A45%3A29
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-29%2020%3A50%3A36



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Next
From: Bruce Momjian
Date:
Subject: Re: New docs chapter on Transaction Management and related changes