Re: Strange failure on mamba - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Strange failure on mamba |
Date | |
Msg-id | 1254382.1669772674@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Strange failure on mamba (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Strange failure on mamba
Re: Strange failure on mamba Re: Strange failure on mamba |
List | pgsql-hackers |
I wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> mamba has been showing intermittent failures in various replication >>> tests since day one. >> I wonder if it's a runtime variant of the other problem. We do >> load_file("libpqwalreceiver", false) before unblocking signals but >> maybe don't resolve the symbols until calling them, or something like >> that... > Yeah, that or some other NetBSD bug could be the explanation, too. > Without a stack trace it's hard to have any confidence about it, > but I've been unable to reproduce the problem outside the buildfarm. Thanks to commit 51b5834cd I've now been able to capture some info from mamba's last couple of failures [1][2]. Sure enough, what is happening is that postmaster children are getting stuck in recursive rtld symbol resolution. A couple of the stack traces I collected are #0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so #1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so #2 0xfdee59e4 in dlopen () from /usr/libexec/ld.elf_so #3 0x01e54ed0 in internal_load_library ( libname=libname@entry=0xfd74cc88 "/home/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/home/buildfarm/bf-data/HEAD/inst/lib/postgresql/libpqwalreceiver.so") atdfmgr.c:239 #4 0x01e55c78 in load_file (filename=<optimized out>, restricted=<optimized out>) at dfmgr.c:156 #5 0x01c5ba24 in WalReceiverMain () at walreceiver.c:292 #6 0x01c090f8 in AuxiliaryProcessMain (auxtype=auxtype@entry=WalReceiverProcess) at auxprocess.c:161 #7 0x01c10970 in StartChildProcess (type=WalReceiverProcess) at postmaster.c:5310 #8 0x01c123ac in MaybeStartWalReceiver () at postmaster.c:5475 #9 MaybeStartWalReceiver () at postmaster.c:5468 #10 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5131 #11 <signal handler called> #12 0xfdee6b44 in _rtld_symlook_obj () from /usr/libexec/ld.elf_so #13 0xfdee6fc0 in _rtld_symlook_list () from /usr/libexec/ld.elf_so #14 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so #15 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so #16 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so #17 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so #18 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so Backtrace stopped: frame did not save the PC #0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so #1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so #2 0xfdee4ba4 in _rtld_exit () from /usr/libexec/ld.elf_so #3 0xfd54ea74 in __cxa_finalize () from /usr/lib/libc.so.12 #4 0xfd54e354 in exit () from /usr/lib/libc.so.12 #5 0x01c963c0 in proc_exit (code=code@entry=0) at ipc.c:152 #6 0x01c056e4 in AutoVacLauncherShutdown () at autovacuum.c:853 #7 0x01c071dc in AutoVacLauncherMain (argv=0x0, argc=0) at autovacuum.c:800 #8 0x01c07694 in StartAutoVacLauncher () at autovacuum.c:416 #9 0x01c11d3c in reaper (postgres_signal_arg=<optimized out>) at postmaster.c:3038 #10 <signal handler called> #11 0xfdee6f64 in _rtld_symlook_list () from /usr/libexec/ld.elf_so #12 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so #13 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so #14 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so #15 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so #16 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so Backtrace stopped: frame did not save the PC which is pretty much just the same thing we were seeing before commit 8acd8f869 :-( Now, we certainly cannot think that these are occurring early in postmaster startup. In the wake of 8acd8f869, we should expect that there's no further need to call rtld_bind at all in the postmaster, but seemingly that's not so. It's very frustrating that the backtrace stops where it does :-(. It's also strange that we're apparently running with signals enabled whereever it is that rtld_bind is getting called from. Could it be that sigaction is failing to install the requested signal mask, so that one postmaster signal handler is interrupting another? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-24%2021%3A45%3A29 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-29%2020%3A50%3A36
pgsql-hackers by date: