Re: Strange failure on mamba - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Strange failure on mamba |
Date | |
Msg-id | 1467663.1669851186@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Strange failure on mamba (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Strange failure on mamba
|
List | pgsql-hackers |
Andres Freund <andres@anarazel.de> writes: > On 2022-11-30 00:55:42 -0500, Tom Lane wrote: >> Googling LD_BIND_NOW suggests that that's a Linux thing; do you know that >> it should have an effect on NetBSD? > I'm not at all sure it does, but I did see it listed in > https://man.netbsd.org/ld.elf_so.1 > LD_BIND_NOW If defined immediate binding of Procedure Link Table > (PLT) entries is performed instead of the default lazy > method. I checked the source code, and learned that (1) yes, rtld does pay attention to this, and (2) the documentation lies: it has to be not only defined, but nonempty, to get any effect. Also, I dug into my stuck processes some more, and I have to take back the claim that this is happening later than postmaster startup. All the stuck children are ones that either are launched on request from the startup process, or are launched as soon as we get the termination report for the startup process. So it's plausible that the problem is happening during the postmaster's first select() wait. I then got dirty with the assembly code, and found out that where the stack trace stops is an attempt to resolve this call: 0xfd6f7a48 <__select50+76>: bl 0xfd700ed0 <0000803c.got2.plt_pic32._sys___select50> which is inside libpthread.so and is trying to call something in libc.so. So we successfully got to the select() function from PostmasterMain, but that has a non-prelinked call to someplace else, and kaboom. In short, looks like Andres' theory is right. It means that 8acd8f869 didn't actually fix anything, though it reduced the probability of the failure by reducing the number of vulnerable PLT-indirect calls. I've adjusted mamba to set LD_BIND_NOW=1 in its environment. I've verified that that causes the call inside __select50 to get resolved before we reach main(), so I'm hopeful that it will cure the issue. But it'll probably be a few weeks before we can be sure. Don't have a good idea about a non-band-aid fix. Perhaps we should revert 8acd8f869 altogether, but then what? Even if somebody comes up with a rewrite to avoid doing interesting stuff in the postmaster's signal handlers, we surely wouldn't risk back-patching it. It's possible that doing nothing is okay, at least in the short term. It's probably nigh impossible to hit this issue on modern multi-CPU hardware. Or perhaps we could revive the idea of having postmaster.c do one dummy select() call before it unblocks signals. regards, tom lane
pgsql-hackers by date: