Reinitialize stack base after fork (for the benefit of rr)? - Mailing list pgsql-hackers

From Andres Freund
Subject Reinitialize stack base after fork (for the benefit of rr)?
Date
Msg-id 20200327182217.ubrrl32lyfhxfwk5@alap3.anarazel.de
Whole thread Raw
Responses Re: Reinitialize stack base after fork (for the benefit of rr)?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Reinitialize stack base after fork (for the benefit of rr)?  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi,

I've found rr [1] very useful to debug issues in postgres. The ability
to hit a bug, and then e.g. identify a pointer with problematic
contents, set a watchpoint on its contents, and reverse-continue is
extremely powerful.

Unfortunately, when running postgres, it currently occasionally triggers
spurious stack-too-deep errors. That turns out to be because it has to
use an alternative stack in some corner cases (IIUC when a signal
arrives while already in a signal handler). That corner case can
unfortunately be hit from within postmaster, and at least can lead to
sigusr1_handler() being called with an alternative stack set.

Unfortunately that means that processes that postmaster fork()s while
using that alternative stack will continue their live using that
alternative stack. Which then subsequently means that our stack depth
checks always trigger.

I've not seen this trigger for normal backends (which makes sense,
they're not started from a signal handler), but for bgworkers. In
particular parallel workers are prone to hit the issue.


I've locally fixed the issue by computing the stack base address anew
for postmaster children. Currently in InitPostmasterChild().

I'd like to get that change upstream. The rr hackers have fixed a number
of other issues that could be hit with postgres, but they couldn't see a
good way to address the potential for a different signal stack in this
edge case. And it doesn't seem crazy to me to compute the stack base
again in postmaster children: It's cheap enough and it's extremely
unlikely that postmaster uses up a crazy amount of stack.

I also don't find it too crazy to guard against forks in signal handlers
leading to a different stack base address. It's a pretty odd thing to
do.


Tom, while imo not a fix of the right magnitude here: Are you planning /
hoping to work again on your postmaster latch patch? I think it'd be
really good if we could restructure the postmaster code to do far far
less in signal handlers. And the postmaster latch patch seems like a big
step in that direction.  I think we mostly dropped it due to the release
schedule last time round?

Greetings,

Andres Freund

[1] https://github.com/mozilla/rr/



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: backup manifests
Next
From: Hamid Akhtar
Date:
Subject: Re: BUG #16040: PL/PGSQL RETURN QUERY statement never uses a parallel plan