Thread: Postmaster self-deadlock due to PLT linkage resolution

Postmaster self-deadlock due to PLT linkage resolution

From
Tom Lane
Date:
Buildfarm member mamba (NetBSD-current on prairiedog's former hardware)
has failed repeatedly since I set it up.  I have now run the cause of
that to ground [1], and here's what's happening: if the postmaster
receives a signal just before it first waits at the select() in
ServerLoop, it can self-deadlock.  During the postmaster's first use of
select(), the dynamic loader needs to resolve the PLT branch table entry
that the core executable uses to reach select() in libc.so, and it locks
the loader's internal data structures while doing that.  If we enter
a signal handler while the lock is held, and the handler needs to do
anything that also requires the lock, the postmaster is frozen.

The probability of this happening seems remarkably small, since there's
only one narrow window per postmaster lifetime, and there's just not
that many potential signal causes active at that time either.
Nonetheless I have traces showing it happening (1) because we receive
SIGCHLD for startup process termination and (2) because we receive
SIGUSR1 from the startup process telling us to start walreceivers.
I guess that mamba's slow single-CPU hardware interacts with the
NetBSD scheduler in just the right way to make it more probable than
you'd think.  On typical modern machines, the postmaster would almost
certainly manage to wait before the startup process is able to signal
it.  Still, "almost certainly" is not "certainly".

The attached patch seems to fix the problem, by forcing resolution of
the PLT link before we unblock signals.  It depends on the assumption
that another select() call appearing within postmaster.c will share
the same PLT link, which seems pretty safe.

I'd originally intended to make this code "#ifdef __NetBSD__",
but on looking into the FreeBSD sources I find much the same locking
logic in their dynamic loader, and now I'm wondering if such behavior
isn't pretty standard.  The added calls should have negligible cost,
so it doesn't seem unreasonable to do them everywhere.

(Of course, a much better answer is to get out of the business of
doing nontrivial stuff in signal handlers.  But even if we get that
done soon, we'd surely not back-patch it.)

Thoughts?

            regards, tom lane

[1] https://gnats.netbsd.org/56979

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1664fcee2a..f9070da574 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1728,6 +1728,33 @@ ServerLoop(void)

     nSockets = initMasks(&readmask);

+    /*
+     * Issue a dummy select() call before we unblock signals.  The point of
+     * this is that select() may be reached via a PLT indirection, and that
+     * indirection will need to be resolved on first use, and on some
+     * platforms the dynamic loader's resolution code takes a lock.  If that
+     * lock is taken while signals are unblocked then we risk self-deadlock
+     * should a signal handler do anything that also needs the loader lock.
+     * This is a low-probability failure, since it can only happen if a signal
+     * arrives with just the right timing during the first pass through the
+     * server loop.  However, it has been seen repeatedly on slow single-CPU
+     * machines.
+     *
+     * Likewise, issue a dummy pg_usleep() to ensure we've resolved any PLT
+     * indirections needed in the PM_WAIT_DEAD_END path.  (The PG_SETMASK
+     * calls could also be at hazard, had we not issued one of those when
+     * blocking signals to begin with.)
+     */
+    {
+        struct timeval timeout0;
+
+        timeout0.tv_sec = 0;
+        timeout0.tv_usec = 0;
+        (void) select(0, NULL, NULL, NULL, &timeout0);
+        pg_usleep(0);
+    }
+
+    /* Main server loop begins here */
     for (;;)
     {
         fd_set        rmask;
@@ -1750,6 +1777,7 @@ ServerLoop(void)
         {
             PG_SETMASK(&UnBlockSig);

+            /* Add nothing else in this unblocked-signals section */
             pg_usleep(100000L); /* 100 msec seems reasonable */
             selres = 0;

@@ -1765,6 +1793,7 @@ ServerLoop(void)

             PG_SETMASK(&UnBlockSig);

+            /* Add nothing else in this unblocked-signals section */
             selres = select(nSockets, &rmask, NULL, NULL, &timeout);

             PG_SETMASK(&BlockSig);

Re: Postmaster self-deadlock due to PLT linkage resolution

From
Noah Misch
Date:
On Mon, Aug 29, 2022 at 03:43:55PM -0400, Tom Lane wrote:
> I'd originally intended to make this code "#ifdef __NetBSD__",
> but on looking into the FreeBSD sources I find much the same locking
> logic in their dynamic loader, and now I'm wondering if such behavior
> isn't pretty standard.

I doubt it's standard.  POSIX specifies select() to be async-signal-safe.
This NetBSD bug makes select() not be async-signal-safe.

> The added calls should have negligible cost,
> so it doesn't seem unreasonable to do them everywhere.

Agreed.  I would make the comment mention the NetBSD version that prompted
this, so we have a better chance of removing the workaround in a few decades.



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Thomas Munro
Date:
On Tue, Aug 30, 2022 at 7:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Buildfarm member mamba (NetBSD-current on prairiedog's former hardware)
> has failed repeatedly since I set it up.  I have now run the cause of
> that to ground [1], and here's what's happening: if the postmaster
> receives a signal just before it first waits at the select() in
> ServerLoop, it can self-deadlock.  During the postmaster's first use of
> select(), the dynamic loader needs to resolve the PLT branch table entry
> that the core executable uses to reach select() in libc.so, and it locks
> the loader's internal data structures while doing that.  If we enter
> a signal handler while the lock is held, and the handler needs to do
> anything that also requires the lock, the postmaster is frozen.

. o O ( pselect() wouldn't have this problem, but it's slightly too
new for the back branches that didn't yet require SUSv3... drat )

> I'd originally intended to make this code "#ifdef __NetBSD__",
> but on looking into the FreeBSD sources I find much the same locking
> logic in their dynamic loader, and now I'm wondering if such behavior
> isn't pretty standard.  The added calls should have negligible cost,
> so it doesn't seem unreasonable to do them everywhere.

FWIW I suspect FreeBSD can't break like this in a program linked with
libthr, because it has a scheme for deferring signals while the
runtime linker holds locks.  _rtld_bind calls _thr_rtld_rlock_acquire,
which uses the THR_CRITICAL_ENTER mechanism to cause thr_sighandler to
defer until release.  For a non-thread program, I'm not entirely sure,
but I don't think the fork() problem exists there.  (Could be wrong,
based on a quick look.)

> (Of course, a much better answer is to get out of the business of
> doing nontrivial stuff in signal handlers.  But even if we get that
> done soon, we'd surely not back-patch it.)

+1



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Robert Haas
Date:
On Tue, Aug 30, 2022 at 8:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> FWIW I suspect FreeBSD can't break like this in a program linked with
> libthr, because it has a scheme for deferring signals while the
> runtime linker holds locks.  _rtld_bind calls _thr_rtld_rlock_acquire,
> which uses the THR_CRITICAL_ENTER mechanism to cause thr_sighandler to
> defer until release.  For a non-thread program, I'm not entirely sure,
> but I don't think the fork() problem exists there.  (Could be wrong,
> based on a quick look.)

Well that seems a bit ironic, considering that Tom has worried in the
past that linking with threading libraries would break stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Thomas Munro
Date:
On Wed, Aug 31, 2022 at 12:26 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 30, 2022 at 8:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > FWIW I suspect FreeBSD can't break like this in a program linked with
> > libthr, because it has a scheme for deferring signals while the
> > runtime linker holds locks.  _rtld_bind calls _thr_rtld_rlock_acquire,
> > which uses the THR_CRITICAL_ENTER mechanism to cause thr_sighandler to
> > defer until release.  For a non-thread program, I'm not entirely sure,
> > but I don't think the fork() problem exists there.  (Could be wrong,
> > based on a quick look.)
>
> Well that seems a bit ironic, considering that Tom has worried in the
> past that linking with threading libraries would break stuff.

Hah.  To clarify, non-thread builds don't have that exact fork()
problem, but it turns out they do have a related state clobbering
problem elsewhere, which I've reported.



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Andres Freund
Date:
Hi,

On 2022-08-29 15:43:55 -0400, Tom Lane wrote:
> Buildfarm member mamba (NetBSD-current on prairiedog's former hardware)
> has failed repeatedly since I set it up.  I have now run the cause of
> that to ground [1], and here's what's happening: if the postmaster
> receives a signal just before it first waits at the select() in
> ServerLoop, it can self-deadlock.  During the postmaster's first use of
> select(), the dynamic loader needs to resolve the PLT branch table entry
> that the core executable uses to reach select() in libc.so, and it locks
> the loader's internal data structures while doing that.  If we enter
> a signal handler while the lock is held, and the handler needs to do
> anything that also requires the lock, the postmaster is frozen.

Ick.


> The attached patch seems to fix the problem, by forcing resolution of
> the PLT link before we unblock signals.  It depends on the assumption
> that another select() call appearing within postmaster.c will share
> the same PLT link, which seems pretty safe.

Hm, what stops the same problem from occuring with other functions?

Perhaps it'd be saner to default to building with -Wl,-z,now? That should fix
the problem too, right (and if we combine it with relro, it'd be a security
improvement to boot).

Greetings,

Andres Freund



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-08-29 15:43:55 -0400, Tom Lane wrote:
>> The attached patch seems to fix the problem, by forcing resolution of
>> the PLT link before we unblock signals.  It depends on the assumption
>> that another select() call appearing within postmaster.c will share
>> the same PLT link, which seems pretty safe.

> Hm, what stops the same problem from occuring with other functions?

These few lines are the only part of the postmaster that runs with
signals enabled and unblocked.

> Perhaps it'd be saner to default to building with -Wl,-z,now? That should fix
> the problem too, right (and if we combine it with relro, it'd be a security
> improvement to boot).

Hm.  Not sure if that works on NetBSD, but I'll check it out.

            regards, tom lane



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Andres Freund
Date:
Hi,

On 2022-08-30 13:24:39 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Perhaps it'd be saner to default to building with -Wl,-z,now? That should fix
> > the problem too, right (and if we combine it with relro, it'd be a security
> > improvement to boot).
> 
> Hm.  Not sure if that works on NetBSD, but I'll check it out.

FWIW, it's a decently (well over 10 years) old thing I think. And it's documented in
the netbsd ld manpage and their packaging guide (albeit indirectly, with their
tooling doing the work of specifying the flags):
https://www.netbsd.org/docs/pkgsrc/hardening.html#hardening.audit.relrofull

Greetings,

Andres Freund



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-08-30 13:24:39 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> Perhaps it'd be saner to default to building with -Wl,-z,now? That should fix
>>> the problem too, right (and if we combine it with relro, it'd be a security
>>> improvement to boot).

>> Hm.  Not sure if that works on NetBSD, but I'll check it out.

> FWIW, it's a decently (well over 10 years) old thing I think. And it's documented in
> the netbsd ld manpage and their packaging guide (albeit indirectly, with their
> tooling doing the work of specifying the flags):
> https://www.netbsd.org/docs/pkgsrc/hardening.html#hardening.audit.relrofull

It does appear that they use GNU ld, and I've just finished confirming
that each of those switches has the expected effects on my PPC box.
So yeah, this looks like a better answer.

Do we want to install this just for NetBSD, or more widely?
I think we'd better back-patch it for NetBSD, so I'm inclined
to be conservative about the change.

            regards, tom lane



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Andres Freund
Date:
Hi,

On 2022-08-30 14:07:41 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-08-30 13:24:39 -0400, Tom Lane wrote:
> >> Andres Freund <andres@anarazel.de> writes:
> >>> Perhaps it'd be saner to default to building with -Wl,-z,now? That should fix
> >>> the problem too, right (and if we combine it with relro, it'd be a security
> >>> improvement to boot).
> 
> >> Hm.  Not sure if that works on NetBSD, but I'll check it out.
> 
> > FWIW, it's a decently (well over 10 years) old thing I think. And it's documented in
> > the netbsd ld manpage and their packaging guide (albeit indirectly, with their
> > tooling doing the work of specifying the flags):
> > https://www.netbsd.org/docs/pkgsrc/hardening.html#hardening.audit.relrofull
> 
> It does appear that they use GNU ld, and I've just finished confirming
> that each of those switches has the expected effects on my PPC box.
> So yeah, this looks like a better answer.

Cool.


> Do we want to install this just for NetBSD, or more widely?
> I think we'd better back-patch it for NetBSD, so I'm inclined
> to be conservative about the change.

It's likely a good idea to enable it everywhere applicable, but I agree that
we shouldn't unnecessarily do so in the backbranches. So I'd be inclined to
add it to the netbsd template for the backbranches.

For HEAD I can see putting it into all the applicable templates, adding an
AC_LINK_IFELSE() test, or just putting it into the meson stuff.

Greetings,

Andres Freund



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-08-30 14:07:41 -0400, Tom Lane wrote:
>> Do we want to install this just for NetBSD, or more widely?
>> I think we'd better back-patch it for NetBSD, so I'm inclined
>> to be conservative about the change.

> It's likely a good idea to enable it everywhere applicable, but I agree that
> we shouldn't unnecessarily do so in the backbranches. So I'd be inclined to
> add it to the netbsd template for the backbranches.

> For HEAD I can see putting it into all the applicable templates, adding an
> AC_LINK_IFELSE() test, or just putting it into the meson stuff.

For the moment I'll stick it into the netbsd template.  I'm not on
board with having the meson stuff generating different executables
than the Makefiles do, so if someone wants to propose applying
this widely, they'll need to fix both.  Seems like that is a good
thing to consider after the meson patches land.  We don't need
unnecessary churn in that area before that.

            regards, tom lane



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Andres Freund
Date:
Hi,

On 2022-08-30 14:32:26 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-08-30 14:07:41 -0400, Tom Lane wrote:
> >> Do we want to install this just for NetBSD, or more widely?
> >> I think we'd better back-patch it for NetBSD, so I'm inclined
> >> to be conservative about the change.
> 
> > It's likely a good idea to enable it everywhere applicable, but I agree that
> > we shouldn't unnecessarily do so in the backbranches. So I'd be inclined to
> > add it to the netbsd template for the backbranches.
> 
> > For HEAD I can see putting it into all the applicable templates, adding an
> > AC_LINK_IFELSE() test, or just putting it into the meson stuff.
> 
> For the moment I'll stick it into the netbsd template.

Cool.


> I'm not on board with having the meson stuff generating different
> executables than the Makefiles do, so if someone wants to propose applying
> this widely, they'll need to fix both.  Seems like that is a good thing to
> consider after the meson patches land.  We don't need unnecessary churn in
> that area before that.

Yea, I didn't like that idea either, hence listing it last...

Greetings,

Andres Freund



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Thomas Munro
Date:
On Wed, Aug 31, 2022 at 1:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Aug 31, 2022 at 12:26 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Aug 30, 2022 at 8:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > FWIW I suspect FreeBSD can't break like this in a program linked with
> > > libthr, because it has a scheme for deferring signals while the
> > > runtime linker holds locks.  _rtld_bind calls _thr_rtld_rlock_acquire,
> > > which uses the THR_CRITICAL_ENTER mechanism to cause thr_sighandler to
> > > defer until release.  For a non-thread program, I'm not entirely sure,
> > > but I don't think the fork() problem exists there.  (Could be wrong,
> > > based on a quick look.)
> >
> > Well that seems a bit ironic, considering that Tom has worried in the
> > past that linking with threading libraries would break stuff.
>
> Hah.  To clarify, non-thread builds don't have that exact fork()
> problem, but it turns out they do have a related state clobbering
> problem elsewhere, which I've reported.

For the record, reporting that resulted in a change for non-libthr rtld:

https://cgit.freebsd.org/src/commit/?id=a687683b997c5805ecd6d8278798b7ef00d9908f



Re: Postmaster self-deadlock due to PLT linkage resolution

From
Thomas Munro
Date:
After commit 7389aad6, I think commit 8acd8f86's linker changes (+
meson.build's equivalent) must now be redundant?