Re: Strange failure on mamba - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Strange failure on mamba |
Date | |
Msg-id | 20221201001957.htscqgtd3fftnuf4@awork3.anarazel.de Whole thread Raw |
In response to | Re: Strange failure on mamba (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Strange failure on mamba
|
List | pgsql-hackers |
Hi, On 2022-11-30 18:33:06 -0500, Tom Lane wrote: > Also, I dug into my stuck processes some more, and I have to take > back the claim that this is happening later than postmaster startup. > All the stuck children are ones that either are launched on request > from the startup process, or are launched as soon as we get the > termination report for the startup process. So it's plausible that > the problem is happening during the postmaster's first select() > wait. I then got dirty with the assembly code, and found out that > where the stack trace stops is an attempt to resolve this call: > > 0xfd6f7a48 <__select50+76>: bl 0xfd700ed0 <0000803c.got2.plt_pic32._sys___select50> > > which is inside libpthread.so and is trying to call something in libc.so. > So we successfully got to the select() function from PostmasterMain, but > that has a non-prelinked call to someplace else, and kaboom. This whole area just seems quite broken in netbsd :(. We're clearly doing stuff in a signal handler that we really shouldn't, but not being able to call any functions implemented in libc, even if they're async signal safe (as e.g. select is) means signals are basically not usable. Afaict this basically means that signals are *never* safe on netbsd, as long as there's a single external function call in a signal handler. > I've adjusted mamba to set LD_BIND_NOW=1 in its environment. > I've verified that that causes the call inside __select50 > to get resolved before we reach main(), so I'm hopeful that > it will cure the issue. But it'll probably be a few weeks > before we can be sure. > > Don't have a good idea about a non-band-aid fix. It's also a band aid, but perhaps a bit more reliable: We could link statically to libc and libpthread. Another approach could be to iterate over the loaded shared libraries during postmaster startup and force symbols to be resolved. IIRC there's functions that'd allow that. But it seems like a lot of work to work around an OS bug. > Perhaps we should revert 8acd8f869 altogether, but then what? FWIW, I think we should consider using those flags everywhere for the backend - they make copy-on-write more effective and decrease connection overhead a bit, because otherwise each backend process does the same symbol resolutions again and again, dirtying memory post-fork. > Even if somebody comes up with a rewrite to avoid doing interesting stuff in > the postmaster's signal handlers, we surely wouldn't risk back-patching it. Would that actually fix anything, given netbsd's brokenness? If we used a latch like mechanism, the signal handler would still use functions in libc. So postmaster could deadlock, at least during the first execution of a signal handler? So I think 8acd8f869 continues to be important... Greetings, Andres Freund
pgsql-hackers by date: