Re: Strange failure on mamba - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Strange failure on mamba
Date
Msg-id 1467663.1669851186@sss.pgh.pa.us
Whole thread Raw
In response to Re: Strange failure on mamba  (Andres Freund <andres@anarazel.de>)
Responses Re: Strange failure on mamba
List pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> On 2022-11-30 00:55:42 -0500, Tom Lane wrote:
>> Googling LD_BIND_NOW suggests that that's a Linux thing; do you know that
>> it should have an effect on NetBSD?

> I'm not at all sure it does, but I did see it listed in
> https://man.netbsd.org/ld.elf_so.1
>      LD_BIND_NOW      If defined immediate binding of Procedure Link Table
>                       (PLT) entries is performed instead of the default lazy
>                       method.

I checked the source code, and learned that (1) yes, rtld does pay
attention to this, and (2) the documentation lies: it has to be not
only defined, but nonempty, to get any effect.

Also, I dug into my stuck processes some more, and I have to take
back the claim that this is happening later than postmaster startup.
All the stuck children are ones that either are launched on request
from the startup process, or are launched as soon as we get the
termination report for the startup process.  So it's plausible that
the problem is happening during the postmaster's first select()
wait.  I then got dirty with the assembly code, and found out that
where the stack trace stops is an attempt to resolve this call:

   0xfd6f7a48 <__select50+76>:  bl      0xfd700ed0 <0000803c.got2.plt_pic32._sys___select50>

which is inside libpthread.so and is trying to call something in libc.so.
So we successfully got to the select() function from PostmasterMain, but
that has a non-prelinked call to someplace else, and kaboom.

In short, looks like Andres' theory is right.  It means that 8acd8f869
didn't actually fix anything, though it reduced the probability of the
failure by reducing the number of vulnerable PLT-indirect calls.

I've adjusted mamba to set LD_BIND_NOW=1 in its environment.
I've verified that that causes the call inside __select50
to get resolved before we reach main(), so I'm hopeful that
it will cure the issue.  But it'll probably be a few weeks
before we can be sure.

Don't have a good idea about a non-band-aid fix.  Perhaps we
should revert 8acd8f869 altogether, but then what?  Even if
somebody comes up with a rewrite to avoid doing interesting
stuff in the postmaster's signal handlers, we surely wouldn't
risk back-patching it.

It's possible that doing nothing is okay, at least in the
short term.  It's probably nigh impossible to hit this
issue on modern multi-CPU hardware.  Or perhaps we could revive
the idea of having postmaster.c do one dummy select() call
before it unblocks signals.

            regards, tom lane



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Non-decimal integer literals
Next
From: Tom Lane
Date:
Subject: Re: Non-decimal integer literals