Thread: common signal handler protection

common signal handler protection

From
Nathan Bossart
Date:
In commit 97550c0, I added a "MyProcPid == getpid()" check in the SIGTERM
handler for the startup process.  This ensures that processes forked by
system(3) (i.e., for restore_command) that have yet to install their own
signal handlers do not call proc_exit() upon receiving SIGTERM.  Without
this protection, both the startup process and the restore_command process
might try to remove themselves from the PGPROC shared array (among other
things), which can end badly.

Since then, I've been exploring a more general approach that would offer
protection against similar issues in the future.  We probably don't want
signal handlers in these grandchild processes to touch shared memory at
all.  The attached 0001 is an attempt at adding such protection for all
handlers installed via pqsignal().  In short, it stores the actual handler
functions in a separate array, and sigaction() is given a wrapper function
that performs the "MyProcPid == getpid()" check.  If that check fails, the
wrapper function installs the default signal handler and calls it.

Besides allowing us to revert commit 97550c0 (see attached 0003), this
wrapper handler could also restore errno, as shown in 0002.  Right now,
individual signal handlers must do this today as needed, but that seems
easy to miss and prone to going unnoticed for a long time.

I see two main downsides of this proposal:

* Overhead: The wrapper handler calls a function pointer and getpid(),
  which AFAICT is a real system call on most platforms.  That might not be
  a tremendous amount of overhead, but it's not zero, either.  I'm
  particularly worried about signal-heavy code like synchronous
  replication.  (Are there other areas that should be tested?)  If this is
  a concern, perhaps we could allow certain processes to opt out of this
  wrapper handler, provided we believe it is unlikely to fork or that the
  handler code is safe to run in grandchild processes.

* Race conditions: With these patches, pqsignal() becomes quite racy when
  used within signal handlers.  Specifically, you might get a bogus return
  value.  However, there are no in-tree callers of pqsignal() that look at
  the return value (and I don't see any reason they should), and it seems
  unlikely that pqsignal() will be used within signal handlers frequently,
  so this might not be a deal-breaker.  I did consider trying to convert
  pqsignal() into a void function, but IIUC that would require an SONAME
  bump.  For now, I've just documented the bogosity of the return values.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: common signal handler protection

From
Nathan Bossart
Date:
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote:
> +#ifdef NSIG
> +#define PG_NSIG (NSIG)
> +#else
> +#define PG_NSIG (64)            /* XXX: wild guess */
> +#endif

> +    Assert(signo < PG_NSIG);

cfbot seems unhappy with this on Windows.  IIUC we need to use
PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one
macro for all platforms.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: common signal handler protection

From
Nathan Bossart
Date:
On Tue, Nov 21, 2023 at 04:40:06PM -0600, Nathan Bossart wrote:
> cfbot seems unhappy with this on Windows.  IIUC we need to use
> PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one
> macro for all platforms.

Here's an attempt at fixing the Windows build.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: common signal handler protection

From
Andres Freund
Date:
Hi,

On 2023-11-22 15:59:44 -0600, Nathan Bossart wrote:
> Subject: [PATCH v2 1/3] Check that MyProcPid == getpid() in all signal
>  handlers.
> 
> In commit 97550c0711, we added a similar check to the SIGTERM
> handler for the startup process.  This commit adds this check to
> all signal handlers installed with pqsignal().  This is done by
> using a wrapper function that performs the check before calling the
> actual handler.
> 
> The hope is that this will offer more general protection against
> grandchildren processes inadvertently modifying shared memory due
> to inherited signal handlers.

It's a bit unclear here what grandchildren refers to - it's presumably in
reference to postmaster, but the preceding text doesn't even mention
postmaster. I'd probably just say "child processes of the current process.


> +
> +#ifdef PG_SIGNAL_COUNT            /* Windows */
> +#define PG_NSIG (PG_SIGNAL_COUNT)
> +#elif defined(NSIG)
> +#define PG_NSIG (NSIG)
> +#else
> +#define PG_NSIG (64)            /* XXX: wild guess */
> +#endif

Perhaps worth adding a static assert for at least a few common types of
signals being below that value? That way we'd see a potential issue without
needing to reach the code path.


> +/*
> + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function
> + * as the handler for all signals.  This wrapper handler function checks that
> + * it is called within a process that the server knows about, and not a
> + * grandchild process forked by system(3), etc.

Similar comment to earlier - the grandchildren bit seems like a dangling
reference. And also too specific - I think we could encounter this in single
user mode as well?

Perhaps it could be reframed to "postgres processes, as determined by having
called InitProcessGlobals()"?


>This check ensures that such
> + * grandchildren processes do not modify shared memory, which could be
> + * detrimental.

"could be" seems a bit euphemistic :)


> From b77da9c54590a71eb9921d6f16bf4ffb0543e87e Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <nathan@postgresql.org>
> Date: Fri, 17 Nov 2023 14:00:12 -0600
> Subject: [PATCH v2 2/3] Centralize logic for restoring errno in signal
>  handlers.
> 
> Presently, we rely on each individual signal handler to save the
> initial value of errno and then restore it before returning if
> needed.  This is easily forgotten and, if missed, often goes
> undetected for a long time.

It's also just verbose :)


> From 5734e0cf5f00bbd74504b45934f68e1c2c1290bd Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <nathan@postgresql.org>
> Date: Fri, 17 Nov 2023 22:09:24 -0600
> Subject: [PATCH v2 3/3] Revert "Avoid calling proc_exit() in processes forked
>  by system()."
> 
> Thanks to commit XXXXXXXXXX, this check in the SIGTERM handler for
> the startup process is now obsolete and can be removed.  Instead of
> leaving around the elog(PANIC, ...) calls that are now unlikely to
> be triggered and the dead function write_stderr_signal_safe(), I've
> opted to just remove them for now.  Thanks to modern version
> control software, it will be trivial to dig those up if they are
> ever needed in the future.
> 
> This reverts commit 97550c0711972a9856b5db751539bbaf2f88884c.
> 
> Reviewed-by: ???
> Discussion: ???
> ---
>  src/backend/postmaster/startup.c | 17 +----------------
>  src/backend/storage/ipc/ipc.c    |  4 ----
>  src/backend/storage/lmgr/proc.c  |  8 --------
>  src/backend/utils/error/elog.c   | 28 ----------------------------
>  src/include/utils/elog.h         |  6 ------
>  5 files changed, 1 insertion(+), 62 deletions(-)
> 
> diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
> index 83dbed86b9..f40acd20ff 100644
> --- a/src/backend/postmaster/startup.c
> +++ b/src/backend/postmaster/startup.c
> @@ -19,8 +19,6 @@
>   */
>  #include "postgres.h"
>  
> -#include <unistd.h>
> -
>  #include "access/xlog.h"
>  #include "access/xlogrecovery.h"
>  #include "access/xlogutils.h"
> @@ -113,20 +111,7 @@ static void
>  StartupProcShutdownHandler(SIGNAL_ARGS)
>  {
>      if (in_restore_command)
> -    {
> -        /*
> -         * If we are in a child process (e.g., forked by system() in
> -         * RestoreArchivedFile()), we don't want to call any exit callbacks.
> -         * The parent will take care of that.
> -         */
> -        if (MyProcPid == (int) getpid())
> -            proc_exit(1);
> -        else
> -        {
> -            write_stderr_signal_safe("StartupProcShutdownHandler() called in child process\n");
> -            _exit(1);
> -        }
> -    }
> +        proc_exit(1);
>      else
>          shutdown_requested = true;
>      WakeupRecovery();

Hm. I wonder if this indicates an issue.  Do the preceding changes perhaps
make it more likely that a child process of the startup process could hang
around for longer, because the signal we're delivering doesn't terminate child
processes, because we'd just reset the signal handler?  I think it's fine for
the startup process, because we ask the startup process to shut down with
SIGTERM, which defaults to exiting.

But we do have a few processes that we do ask to shut down with other
signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver,
walsender are asked to shut down using SIGUSR2 IIRC.  The only one where that
could be an issue is archiver, I guess?


> diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
> index 6591b5d6a8..1904d21795 100644
> --- a/src/backend/storage/ipc/ipc.c
> +++ b/src/backend/storage/ipc/ipc.c
> @@ -103,10 +103,6 @@ static int    on_proc_exit_index,
>  void
>  proc_exit(int code)
>  {
> -    /* not safe if forked by system(), etc. */
> -    if (MyProcPid != (int) getpid())
> -        elog(PANIC, "proc_exit() called in child process");
> -
>      /* Clean up everything that must be cleaned up */
>      proc_exit_prepare(code);

> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index e9e445bb21..5b663a2997 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -806,10 +806,6 @@ ProcKill(int code, Datum arg)
>  
>      Assert(MyProc != NULL);
>  
> -    /* not safe if forked by system(), etc. */
> -    if (MyProc->pid != (int) getpid())
> -        elog(PANIC, "ProcKill() called in child process");
> -
>      /* Make sure we're out of the sync rep lists */
>      SyncRepCleanupAtProcExit();
>  
> @@ -930,10 +926,6 @@ AuxiliaryProcKill(int code, Datum arg)
>  
>      Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS);
>  
> -    /* not safe if forked by system(), etc. */
> -    if (MyProc->pid != (int) getpid())
> -        elog(PANIC, "AuxiliaryProcKill() called in child process");
> -
>      auxproc = &AuxiliaryProcs[proctype];
>  
>      Assert(MyProc == auxproc);

I think we should leave these checks. It's awful to debug situations where a
proc gets reused and the cost of the checks is irrelevant. The checks don't
just protect against child processes, they also protect e.g. against calling
those functions twice (we IIRC had cases of that).

Greetings,

Andres Freund



Re: common signal handler protection

From
Nathan Bossart
Date:
Thanks for taking a look.

On Wed, Nov 22, 2023 at 02:59:45PM -0800, Andres Freund wrote:
> On 2023-11-22 15:59:44 -0600, Nathan Bossart wrote:
>> +/*
>> + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function
>> + * as the handler for all signals.  This wrapper handler function checks that
>> + * it is called within a process that the server knows about, and not a
>> + * grandchild process forked by system(3), etc.
> 
> Similar comment to earlier - the grandchildren bit seems like a dangling
> reference. And also too specific - I think we could encounter this in single
> user mode as well?
> 
> Perhaps it could be reframed to "postgres processes, as determined by having
> called InitProcessGlobals()"?

Eh, apparently my attempt at being clever didn't pan out.  I like your idea
of specifying InitProcessGlobals(), but I might also include "e.g., client
backends", too.

> Hm. I wonder if this indicates an issue.  Do the preceding changes perhaps
> make it more likely that a child process of the startup process could hang
> around for longer, because the signal we're delivering doesn't terminate child
> processes, because we'd just reset the signal handler?  I think it's fine for
> the startup process, because we ask the startup process to shut down with
> SIGTERM, which defaults to exiting.
> 
> But we do have a few processes that we do ask to shut down with other
> signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver,
> walsender are asked to shut down using SIGUSR2 IIRC.  The only one where that
> could be an issue is archiver, I guess?

This did cross my mind.  AFAICT most default handlers already trigger an
exit (including SIGUSR2), and for the ones that don't, we'd just end up in
the same situation as if the signal arrived a moment later after the child
process has installed its own handlers.  And postmaster doesn't send
certain signals (e.g., SIGHUP) to the whole process group, so we don't have
the opposite problem where things like reloading configuration files
terminates these child processes.

So, I didn't notice any potential issues.  Did you have anything else in
mind?

>> -    /* not safe if forked by system(), etc. */
>> -    if (MyProc->pid != (int) getpid())
>> -        elog(PANIC, "AuxiliaryProcKill() called in child process");
>> -
>>      auxproc = &AuxiliaryProcs[proctype];
>>  
>>      Assert(MyProc == auxproc);
> 
> I think we should leave these checks. It's awful to debug situations where a
> proc gets reused and the cost of the checks is irrelevant. The checks don't
> just protect against child processes, they also protect e.g. against calling
> those functions twice (we IIRC had cases of that).

Sure, that's no problem.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: common signal handler protection

From
Andres Freund
Date:
Hi,

On 2023-11-28 15:39:55 -0600, Nathan Bossart wrote:
> From e4bea5353c2685457545b67396095e9b96156982 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <nathan@postgresql.org>
> Date: Tue, 28 Nov 2023 14:58:20 -0600
> Subject: [PATCH v3 1/3] Check that MyProcPid == getpid() in all signal
>  handlers.
> 
> In commit 97550c0711, we added a similar check to the SIGTERM
> handler for the startup process.  This commit adds this check to
> all signal handlers installed with pqsignal().  This is done by
> using a wrapper function that performs the check before calling the
> actual handler.
> 
> The hope is that this will offer more general protection against
> child processes of Postgres backends inadvertently modifying shared
> memory due to inherited signal handlers.  Another potential
> follow-up improvement is to use this wrapper handler function to
> restore errno instead of relying on each individual handler
> function to do so.
> 
> This commit makes the changes in commit 97550c0711 obsolete but
> leaves reverting it for a follow-up commit.

For a moment I was, wrongly, worried this would break signal handlers we
intentionally inherit from postmaster. It's fine though, because we block
signals in fork_process() until somewhere in InitPostmasterChild(), after
we've called InitProcessGlobals(). But perhaps that should be commented upon
somewhere?

Greetings,

Andres Freund



Re: common signal handler protection

From
Andres Freund
Date:
Hi,

On 2023-11-27 16:16:25 -0600, Nathan Bossart wrote:
> On Wed, Nov 22, 2023 at 02:59:45PM -0800, Andres Freund wrote:
> > Hm. I wonder if this indicates an issue.  Do the preceding changes perhaps
> > make it more likely that a child process of the startup process could hang
> > around for longer, because the signal we're delivering doesn't terminate child
> > processes, because we'd just reset the signal handler?  I think it's fine for
> > the startup process, because we ask the startup process to shut down with
> > SIGTERM, which defaults to exiting.
> > 
> > But we do have a few processes that we do ask to shut down with other
> > signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver,
> > walsender are asked to shut down using SIGUSR2 IIRC.  The only one where that
> > could be an issue is archiver, I guess?
> 
> This did cross my mind.  AFAICT most default handlers already trigger an
> exit (including SIGUSR2), and for the ones that don't, we'd just end up in
> the same situation as if the signal arrived a moment later after the child
> process has installed its own handlers.  And postmaster doesn't send
> certain signals (e.g., SIGHUP) to the whole process group, so we don't have
> the opposite problem where things like reloading configuration files
> terminates these child processes.
> 
> So, I didn't notice any potential issues.  Did you have anything else in
> mind?

No, I just was wondering about issues in this area. I couldn't immediately
think of a concrete scenario, so I thought I'd mention it here.

Greetings,

Andres Freund



Re: common signal handler protection

From
Nathan Bossart
Date:
On Tue, Nov 28, 2023 at 06:37:50PM -0800, Andres Freund wrote:
> For a moment I was, wrongly, worried this would break signal handlers we
> intentionally inherit from postmaster. It's fine though, because we block
> signals in fork_process() until somewhere in InitPostmasterChild(), after
> we've called InitProcessGlobals(). But perhaps that should be commented upon
> somewhere?

Good call.  I expanded on the MyProcPid assertion in wrapper_handler() a
bit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: common signal handler protection

From
Nathan Bossart
Date:
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote:
> * Overhead: The wrapper handler calls a function pointer and getpid(),
>   which AFAICT is a real system call on most platforms.  That might not be
>   a tremendous amount of overhead, but it's not zero, either.  I'm
>   particularly worried about signal-heavy code like synchronous
>   replication.  (Are there other areas that should be tested?)  If this is
>   a concern, perhaps we could allow certain processes to opt out of this
>   wrapper handler, provided we believe it is unlikely to fork or that the
>   handler code is safe to run in grandchild processes.

I finally spent some time trying to measure this overhead.  Specifically, I
sent many, many SIGUSR2 signals to postmaster, which just uses
dummy_handler(), i.e., does nothing.  I was just barely able to get
wrapper_handler() to show up in the first page of 'perf top' in this
extreme case, which leads me to think that the overhead might not be a
problem.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: common signal handler protection

From
Andres Freund
Date:
Hi,

On 2024-02-06 20:39:41 -0600, Nathan Bossart wrote:
> On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote:
> > * Overhead: The wrapper handler calls a function pointer and getpid(),
> >   which AFAICT is a real system call on most platforms.  That might not be
> >   a tremendous amount of overhead, but it's not zero, either.  I'm
> >   particularly worried about signal-heavy code like synchronous
> >   replication.  (Are there other areas that should be tested?)  If this is
> >   a concern, perhaps we could allow certain processes to opt out of this
> >   wrapper handler, provided we believe it is unlikely to fork or that the
> >   handler code is safe to run in grandchild processes.
> 
> I finally spent some time trying to measure this overhead.  Specifically, I
> sent many, many SIGUSR2 signals to postmaster, which just uses
> dummy_handler(), i.e., does nothing.  I was just barely able to get
> wrapper_handler() to show up in the first page of 'perf top' in this
> extreme case, which leads me to think that the overhead might not be a
> problem.

That's what I'd expect. Signal delivery is fairly heavyweight, getpid() is one
of the cheapest system calls (IIRC only beat by close() of an invalid FD on
recent-ish linux).  If it were to become an issue, we'd much better spend our
time reducing the millions of signals/sec that'd have to involve.

Greetings,

Andres Freund



Re: common signal handler protection

From
Nathan Bossart
Date:
On Tue, Feb 06, 2024 at 06:48:53PM -0800, Andres Freund wrote:
> On 2024-02-06 20:39:41 -0600, Nathan Bossart wrote:
>> I finally spent some time trying to measure this overhead.  Specifically, I
>> sent many, many SIGUSR2 signals to postmaster, which just uses
>> dummy_handler(), i.e., does nothing.  I was just barely able to get
>> wrapper_handler() to show up in the first page of 'perf top' in this
>> extreme case, which leads me to think that the overhead might not be a
>> problem.
> 
> That's what I'd expect. Signal delivery is fairly heavyweight, getpid() is one
> of the cheapest system calls (IIRC only beat by close() of an invalid FD on
> recent-ish linux).  If it were to become an issue, we'd much better spend our
> time reducing the millions of signals/sec that'd have to involve.

Indeed.

I'd like to get this committed (to HEAD only) in the next few weeks.  TBH
I'm not wild about the weird caveats (e.g., race conditions when pqsignal()
is called within a signal handler), but I also think it is unlikely that
they cause any issues in practice.  Please do let me know if you have any
concerns about this.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: common signal handler protection

From
Nathan Bossart
Date:
Sorry for the noise.

On Wed, Feb 07, 2024 at 11:06:50AM -0600, Nathan Bossart wrote:
> I'd like to get this committed (to HEAD only) in the next few weeks.  TBH
> I'm not wild about the weird caveats (e.g., race conditions when pqsignal()
> is called within a signal handler), but I also think it is unlikely that
> they cause any issues in practice.  Please do let me know if you have any
> concerns about this.

Perhaps we should add a file global bool that is only set during
wrapper_handler().  Then we could Assert() or elog(ERROR, ...) if
pqsignal() is called with it set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: common signal handler protection

From
Andres Freund
Date:
Hi,

On 2024-02-07 11:15:54 -0600, Nathan Bossart wrote:
> On Wed, Feb 07, 2024 at 11:06:50AM -0600, Nathan Bossart wrote:
> > I'd like to get this committed (to HEAD only) in the next few weeks.  TBH
> > I'm not wild about the weird caveats (e.g., race conditions when pqsignal()
> > is called within a signal handler), but I also think it is unlikely that
> > they cause any issues in practice.  Please do let me know if you have any
> > concerns about this.

I don't.


> Perhaps we should add a file global bool that is only set during
> wrapper_handler().  Then we could Assert() or elog(ERROR, ...) if
> pqsignal() is called with it set.

In older branches that might have been harder (due to forking from a signal
handler and non-fatal errors thrown from signal handlers), but these days I
think that should work.

FWIW, I don't think elog(ERROR) would be appropriate, that'd be jumping out of
a signal handler :)


If it were just for the purpose of avoiding the issue you brought up it might
not quite be worth it - but there are a lot of things we want to forbid in a
signal handler. Memory allocations, acquiring locks, throwing non-panic
errors, etc. That's one of the main reasons I like a common wrapper signal
handler.


Which reminded me of https://postgr.es/m/87msstvper.fsf%40163.com - the set of
things we want to forbid are similar. I'm not sure there's really room to
harmonize things, but I thought I'd raise it.

Perhaps we should make the state a bitmap and have a single
  AssertNotInState(HOLDING_SPINLOCK | IN_SIGNALHANDLER)

Greetings,

Andres Freund



Re: common signal handler protection

From
Nathan Bossart
Date:
On Wed, Feb 07, 2024 at 10:40:50AM -0800, Andres Freund wrote:
> On 2024-02-07 11:15:54 -0600, Nathan Bossart wrote:
>> Perhaps we should add a file global bool that is only set during
>> wrapper_handler().  Then we could Assert() or elog(ERROR, ...) if
>> pqsignal() is called with it set.
> 
> In older branches that might have been harder (due to forking from a signal
> handler and non-fatal errors thrown from signal handlers), but these days I
> think that should work.
> 
> FWIW, I don't think elog(ERROR) would be appropriate, that'd be jumping out of
> a signal handler :)

*facepalm*  Yes.

> If it were just for the purpose of avoiding the issue you brought up it might
> not quite be worth it - but there are a lot of things we want to forbid in a
> signal handler. Memory allocations, acquiring locks, throwing non-panic
> errors, etc. That's one of the main reasons I like a common wrapper signal
> handler.
> 
> Which reminded me of https://postgr.es/m/87msstvper.fsf%40163.com - the set of
> things we want to forbid are similar. I'm not sure there's really room to
> harmonize things, but I thought I'd raise it.
> 
> Perhaps we should make the state a bitmap and have a single
>   AssertNotInState(HOLDING_SPINLOCK | IN_SIGNALHANDLER)

Seems worth a try.  I'll go ahead and proceed with these patches and leave
this improvement for another thread.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: common signal handler protection

From
Noah Misch
Date:
On Mon, Dec 18, 2023 at 11:32:47AM -0600, Nathan Bossart wrote:
> rebased for cfbot

I took a look over each of these.  +1 for all.  Thank you.



Re: common signal handler protection

From
Nathan Bossart
Date:
On Wed, Feb 14, 2024 at 11:55:43AM -0800, Noah Misch wrote:
> I took a look over each of these.  +1 for all.  Thank you.

Committed.  Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com