Thread: common signal handler protection
In commit 97550c0, I added a "MyProcPid == getpid()" check in the SIGTERM handler for the startup process. This ensures that processes forked by system(3) (i.e., for restore_command) that have yet to install their own signal handlers do not call proc_exit() upon receiving SIGTERM. Without this protection, both the startup process and the restore_command process might try to remove themselves from the PGPROC shared array (among other things), which can end badly. Since then, I've been exploring a more general approach that would offer protection against similar issues in the future. We probably don't want signal handlers in these grandchild processes to touch shared memory at all. The attached 0001 is an attempt at adding such protection for all handlers installed via pqsignal(). In short, it stores the actual handler functions in a separate array, and sigaction() is given a wrapper function that performs the "MyProcPid == getpid()" check. If that check fails, the wrapper function installs the default signal handler and calls it. Besides allowing us to revert commit 97550c0 (see attached 0003), this wrapper handler could also restore errno, as shown in 0002. Right now, individual signal handlers must do this today as needed, but that seems easy to miss and prone to going unnoticed for a long time. I see two main downsides of this proposal: * Overhead: The wrapper handler calls a function pointer and getpid(), which AFAICT is a real system call on most platforms. That might not be a tremendous amount of overhead, but it's not zero, either. I'm particularly worried about signal-heavy code like synchronous replication. (Are there other areas that should be tested?) If this is a concern, perhaps we could allow certain processes to opt out of this wrapper handler, provided we believe it is unlikely to fork or that the handler code is safe to run in grandchild processes. * Race conditions: With these patches, pqsignal() becomes quite racy when used within signal handlers. Specifically, you might get a bogus return value. However, there are no in-tree callers of pqsignal() that look at the return value (and I don't see any reason they should), and it seems unlikely that pqsignal() will be used within signal handlers frequently, so this might not be a deal-breaker. I did consider trying to convert pqsignal() into a void function, but IIUC that would require an SONAME bump. For now, I've just documented the bogosity of the return values. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote: > +#ifdef NSIG > +#define PG_NSIG (NSIG) > +#else > +#define PG_NSIG (64) /* XXX: wild guess */ > +#endif > + Assert(signo < PG_NSIG); cfbot seems unhappy with this on Windows. IIUC we need to use PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one macro for all platforms. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Tue, Nov 21, 2023 at 04:40:06PM -0600, Nathan Bossart wrote: > cfbot seems unhappy with this on Windows. IIUC we need to use > PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one > macro for all platforms. Here's an attempt at fixing the Windows build. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 2023-11-22 15:59:44 -0600, Nathan Bossart wrote: > Subject: [PATCH v2 1/3] Check that MyProcPid == getpid() in all signal > handlers. > > In commit 97550c0711, we added a similar check to the SIGTERM > handler for the startup process. This commit adds this check to > all signal handlers installed with pqsignal(). This is done by > using a wrapper function that performs the check before calling the > actual handler. > > The hope is that this will offer more general protection against > grandchildren processes inadvertently modifying shared memory due > to inherited signal handlers. It's a bit unclear here what grandchildren refers to - it's presumably in reference to postmaster, but the preceding text doesn't even mention postmaster. I'd probably just say "child processes of the current process. > + > +#ifdef PG_SIGNAL_COUNT /* Windows */ > +#define PG_NSIG (PG_SIGNAL_COUNT) > +#elif defined(NSIG) > +#define PG_NSIG (NSIG) > +#else > +#define PG_NSIG (64) /* XXX: wild guess */ > +#endif Perhaps worth adding a static assert for at least a few common types of signals being below that value? That way we'd see a potential issue without needing to reach the code path. > +/* > + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function > + * as the handler for all signals. This wrapper handler function checks that > + * it is called within a process that the server knows about, and not a > + * grandchild process forked by system(3), etc. Similar comment to earlier - the grandchildren bit seems like a dangling reference. And also too specific - I think we could encounter this in single user mode as well? Perhaps it could be reframed to "postgres processes, as determined by having called InitProcessGlobals()"? >This check ensures that such > + * grandchildren processes do not modify shared memory, which could be > + * detrimental. "could be" seems a bit euphemistic :) > From b77da9c54590a71eb9921d6f16bf4ffb0543e87e Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <nathan@postgresql.org> > Date: Fri, 17 Nov 2023 14:00:12 -0600 > Subject: [PATCH v2 2/3] Centralize logic for restoring errno in signal > handlers. > > Presently, we rely on each individual signal handler to save the > initial value of errno and then restore it before returning if > needed. This is easily forgotten and, if missed, often goes > undetected for a long time. It's also just verbose :) > From 5734e0cf5f00bbd74504b45934f68e1c2c1290bd Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <nathan@postgresql.org> > Date: Fri, 17 Nov 2023 22:09:24 -0600 > Subject: [PATCH v2 3/3] Revert "Avoid calling proc_exit() in processes forked > by system()." > > Thanks to commit XXXXXXXXXX, this check in the SIGTERM handler for > the startup process is now obsolete and can be removed. Instead of > leaving around the elog(PANIC, ...) calls that are now unlikely to > be triggered and the dead function write_stderr_signal_safe(), I've > opted to just remove them for now. Thanks to modern version > control software, it will be trivial to dig those up if they are > ever needed in the future. > > This reverts commit 97550c0711972a9856b5db751539bbaf2f88884c. > > Reviewed-by: ??? > Discussion: ??? > --- > src/backend/postmaster/startup.c | 17 +---------------- > src/backend/storage/ipc/ipc.c | 4 ---- > src/backend/storage/lmgr/proc.c | 8 -------- > src/backend/utils/error/elog.c | 28 ---------------------------- > src/include/utils/elog.h | 6 ------ > 5 files changed, 1 insertion(+), 62 deletions(-) > > diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c > index 83dbed86b9..f40acd20ff 100644 > --- a/src/backend/postmaster/startup.c > +++ b/src/backend/postmaster/startup.c > @@ -19,8 +19,6 @@ > */ > #include "postgres.h" > > -#include <unistd.h> > - > #include "access/xlog.h" > #include "access/xlogrecovery.h" > #include "access/xlogutils.h" > @@ -113,20 +111,7 @@ static void > StartupProcShutdownHandler(SIGNAL_ARGS) > { > if (in_restore_command) > - { > - /* > - * If we are in a child process (e.g., forked by system() in > - * RestoreArchivedFile()), we don't want to call any exit callbacks. > - * The parent will take care of that. > - */ > - if (MyProcPid == (int) getpid()) > - proc_exit(1); > - else > - { > - write_stderr_signal_safe("StartupProcShutdownHandler() called in child process\n"); > - _exit(1); > - } > - } > + proc_exit(1); > else > shutdown_requested = true; > WakeupRecovery(); Hm. I wonder if this indicates an issue. Do the preceding changes perhaps make it more likely that a child process of the startup process could hang around for longer, because the signal we're delivering doesn't terminate child processes, because we'd just reset the signal handler? I think it's fine for the startup process, because we ask the startup process to shut down with SIGTERM, which defaults to exiting. But we do have a few processes that we do ask to shut down with other signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver, walsender are asked to shut down using SIGUSR2 IIRC. The only one where that could be an issue is archiver, I guess? > diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c > index 6591b5d6a8..1904d21795 100644 > --- a/src/backend/storage/ipc/ipc.c > +++ b/src/backend/storage/ipc/ipc.c > @@ -103,10 +103,6 @@ static int on_proc_exit_index, > void > proc_exit(int code) > { > - /* not safe if forked by system(), etc. */ > - if (MyProcPid != (int) getpid()) > - elog(PANIC, "proc_exit() called in child process"); > - > /* Clean up everything that must be cleaned up */ > proc_exit_prepare(code); > diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c > index e9e445bb21..5b663a2997 100644 > --- a/src/backend/storage/lmgr/proc.c > +++ b/src/backend/storage/lmgr/proc.c > @@ -806,10 +806,6 @@ ProcKill(int code, Datum arg) > > Assert(MyProc != NULL); > > - /* not safe if forked by system(), etc. */ > - if (MyProc->pid != (int) getpid()) > - elog(PANIC, "ProcKill() called in child process"); > - > /* Make sure we're out of the sync rep lists */ > SyncRepCleanupAtProcExit(); > > @@ -930,10 +926,6 @@ AuxiliaryProcKill(int code, Datum arg) > > Assert(proctype >= 0 && proctype < NUM_AUXILIARY_PROCS); > > - /* not safe if forked by system(), etc. */ > - if (MyProc->pid != (int) getpid()) > - elog(PANIC, "AuxiliaryProcKill() called in child process"); > - > auxproc = &AuxiliaryProcs[proctype]; > > Assert(MyProc == auxproc); I think we should leave these checks. It's awful to debug situations where a proc gets reused and the cost of the checks is irrelevant. The checks don't just protect against child processes, they also protect e.g. against calling those functions twice (we IIRC had cases of that). Greetings, Andres Freund
Thanks for taking a look. On Wed, Nov 22, 2023 at 02:59:45PM -0800, Andres Freund wrote: > On 2023-11-22 15:59:44 -0600, Nathan Bossart wrote: >> +/* >> + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function >> + * as the handler for all signals. This wrapper handler function checks that >> + * it is called within a process that the server knows about, and not a >> + * grandchild process forked by system(3), etc. > > Similar comment to earlier - the grandchildren bit seems like a dangling > reference. And also too specific - I think we could encounter this in single > user mode as well? > > Perhaps it could be reframed to "postgres processes, as determined by having > called InitProcessGlobals()"? Eh, apparently my attempt at being clever didn't pan out. I like your idea of specifying InitProcessGlobals(), but I might also include "e.g., client backends", too. > Hm. I wonder if this indicates an issue. Do the preceding changes perhaps > make it more likely that a child process of the startup process could hang > around for longer, because the signal we're delivering doesn't terminate child > processes, because we'd just reset the signal handler? I think it's fine for > the startup process, because we ask the startup process to shut down with > SIGTERM, which defaults to exiting. > > But we do have a few processes that we do ask to shut down with other > signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver, > walsender are asked to shut down using SIGUSR2 IIRC. The only one where that > could be an issue is archiver, I guess? This did cross my mind. AFAICT most default handlers already trigger an exit (including SIGUSR2), and for the ones that don't, we'd just end up in the same situation as if the signal arrived a moment later after the child process has installed its own handlers. And postmaster doesn't send certain signals (e.g., SIGHUP) to the whole process group, so we don't have the opposite problem where things like reloading configuration files terminates these child processes. So, I didn't notice any potential issues. Did you have anything else in mind? >> - /* not safe if forked by system(), etc. */ >> - if (MyProc->pid != (int) getpid()) >> - elog(PANIC, "AuxiliaryProcKill() called in child process"); >> - >> auxproc = &AuxiliaryProcs[proctype]; >> >> Assert(MyProc == auxproc); > > I think we should leave these checks. It's awful to debug situations where a > proc gets reused and the cost of the checks is irrelevant. The checks don't > just protect against child processes, they also protect e.g. against calling > those functions twice (we IIRC had cases of that). Sure, that's no problem. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Here is a new patch set with feedback addressed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On 2023-11-28 15:39:55 -0600, Nathan Bossart wrote: > From e4bea5353c2685457545b67396095e9b96156982 Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <nathan@postgresql.org> > Date: Tue, 28 Nov 2023 14:58:20 -0600 > Subject: [PATCH v3 1/3] Check that MyProcPid == getpid() in all signal > handlers. > > In commit 97550c0711, we added a similar check to the SIGTERM > handler for the startup process. This commit adds this check to > all signal handlers installed with pqsignal(). This is done by > using a wrapper function that performs the check before calling the > actual handler. > > The hope is that this will offer more general protection against > child processes of Postgres backends inadvertently modifying shared > memory due to inherited signal handlers. Another potential > follow-up improvement is to use this wrapper handler function to > restore errno instead of relying on each individual handler > function to do so. > > This commit makes the changes in commit 97550c0711 obsolete but > leaves reverting it for a follow-up commit. For a moment I was, wrongly, worried this would break signal handlers we intentionally inherit from postmaster. It's fine though, because we block signals in fork_process() until somewhere in InitPostmasterChild(), after we've called InitProcessGlobals(). But perhaps that should be commented upon somewhere? Greetings, Andres Freund
Hi, On 2023-11-27 16:16:25 -0600, Nathan Bossart wrote: > On Wed, Nov 22, 2023 at 02:59:45PM -0800, Andres Freund wrote: > > Hm. I wonder if this indicates an issue. Do the preceding changes perhaps > > make it more likely that a child process of the startup process could hang > > around for longer, because the signal we're delivering doesn't terminate child > > processes, because we'd just reset the signal handler? I think it's fine for > > the startup process, because we ask the startup process to shut down with > > SIGTERM, which defaults to exiting. > > > > But we do have a few processes that we do ask to shut down with other > > signals, wich do not trigger an exit by default, e.g. Checkpointer, archiver, > > walsender are asked to shut down using SIGUSR2 IIRC. The only one where that > > could be an issue is archiver, I guess? > > This did cross my mind. AFAICT most default handlers already trigger an > exit (including SIGUSR2), and for the ones that don't, we'd just end up in > the same situation as if the signal arrived a moment later after the child > process has installed its own handlers. And postmaster doesn't send > certain signals (e.g., SIGHUP) to the whole process group, so we don't have > the opposite problem where things like reloading configuration files > terminates these child processes. > > So, I didn't notice any potential issues. Did you have anything else in > mind? No, I just was wondering about issues in this area. I couldn't immediately think of a concrete scenario, so I thought I'd mention it here. Greetings, Andres Freund
On Tue, Nov 28, 2023 at 06:37:50PM -0800, Andres Freund wrote: > For a moment I was, wrongly, worried this would break signal handlers we > intentionally inherit from postmaster. It's fine though, because we block > signals in fork_process() until somewhere in InitPostmasterChild(), after > we've called InitProcessGlobals(). But perhaps that should be commented upon > somewhere? Good call. I expanded on the MyProcPid assertion in wrapper_handler() a bit. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
rebased for cfbot -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote: > * Overhead: The wrapper handler calls a function pointer and getpid(), > which AFAICT is a real system call on most platforms. That might not be > a tremendous amount of overhead, but it's not zero, either. I'm > particularly worried about signal-heavy code like synchronous > replication. (Are there other areas that should be tested?) If this is > a concern, perhaps we could allow certain processes to opt out of this > wrapper handler, provided we believe it is unlikely to fork or that the > handler code is safe to run in grandchild processes. I finally spent some time trying to measure this overhead. Specifically, I sent many, many SIGUSR2 signals to postmaster, which just uses dummy_handler(), i.e., does nothing. I was just barely able to get wrapper_handler() to show up in the first page of 'perf top' in this extreme case, which leads me to think that the overhead might not be a problem. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2024-02-06 20:39:41 -0600, Nathan Bossart wrote: > On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote: > > * Overhead: The wrapper handler calls a function pointer and getpid(), > > which AFAICT is a real system call on most platforms. That might not be > > a tremendous amount of overhead, but it's not zero, either. I'm > > particularly worried about signal-heavy code like synchronous > > replication. (Are there other areas that should be tested?) If this is > > a concern, perhaps we could allow certain processes to opt out of this > > wrapper handler, provided we believe it is unlikely to fork or that the > > handler code is safe to run in grandchild processes. > > I finally spent some time trying to measure this overhead. Specifically, I > sent many, many SIGUSR2 signals to postmaster, which just uses > dummy_handler(), i.e., does nothing. I was just barely able to get > wrapper_handler() to show up in the first page of 'perf top' in this > extreme case, which leads me to think that the overhead might not be a > problem. That's what I'd expect. Signal delivery is fairly heavyweight, getpid() is one of the cheapest system calls (IIRC only beat by close() of an invalid FD on recent-ish linux). If it were to become an issue, we'd much better spend our time reducing the millions of signals/sec that'd have to involve. Greetings, Andres Freund
On Tue, Feb 06, 2024 at 06:48:53PM -0800, Andres Freund wrote: > On 2024-02-06 20:39:41 -0600, Nathan Bossart wrote: >> I finally spent some time trying to measure this overhead. Specifically, I >> sent many, many SIGUSR2 signals to postmaster, which just uses >> dummy_handler(), i.e., does nothing. I was just barely able to get >> wrapper_handler() to show up in the first page of 'perf top' in this >> extreme case, which leads me to think that the overhead might not be a >> problem. > > That's what I'd expect. Signal delivery is fairly heavyweight, getpid() is one > of the cheapest system calls (IIRC only beat by close() of an invalid FD on > recent-ish linux). If it were to become an issue, we'd much better spend our > time reducing the millions of signals/sec that'd have to involve. Indeed. I'd like to get this committed (to HEAD only) in the next few weeks. TBH I'm not wild about the weird caveats (e.g., race conditions when pqsignal() is called within a signal handler), but I also think it is unlikely that they cause any issues in practice. Please do let me know if you have any concerns about this. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Sorry for the noise. On Wed, Feb 07, 2024 at 11:06:50AM -0600, Nathan Bossart wrote: > I'd like to get this committed (to HEAD only) in the next few weeks. TBH > I'm not wild about the weird caveats (e.g., race conditions when pqsignal() > is called within a signal handler), but I also think it is unlikely that > they cause any issues in practice. Please do let me know if you have any > concerns about this. Perhaps we should add a file global bool that is only set during wrapper_handler(). Then we could Assert() or elog(ERROR, ...) if pqsignal() is called with it set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2024-02-07 11:15:54 -0600, Nathan Bossart wrote: > On Wed, Feb 07, 2024 at 11:06:50AM -0600, Nathan Bossart wrote: > > I'd like to get this committed (to HEAD only) in the next few weeks. TBH > > I'm not wild about the weird caveats (e.g., race conditions when pqsignal() > > is called within a signal handler), but I also think it is unlikely that > > they cause any issues in practice. Please do let me know if you have any > > concerns about this. I don't. > Perhaps we should add a file global bool that is only set during > wrapper_handler(). Then we could Assert() or elog(ERROR, ...) if > pqsignal() is called with it set. In older branches that might have been harder (due to forking from a signal handler and non-fatal errors thrown from signal handlers), but these days I think that should work. FWIW, I don't think elog(ERROR) would be appropriate, that'd be jumping out of a signal handler :) If it were just for the purpose of avoiding the issue you brought up it might not quite be worth it - but there are a lot of things we want to forbid in a signal handler. Memory allocations, acquiring locks, throwing non-panic errors, etc. That's one of the main reasons I like a common wrapper signal handler. Which reminded me of https://postgr.es/m/87msstvper.fsf%40163.com - the set of things we want to forbid are similar. I'm not sure there's really room to harmonize things, but I thought I'd raise it. Perhaps we should make the state a bitmap and have a single AssertNotInState(HOLDING_SPINLOCK | IN_SIGNALHANDLER) Greetings, Andres Freund
On Wed, Feb 07, 2024 at 10:40:50AM -0800, Andres Freund wrote: > On 2024-02-07 11:15:54 -0600, Nathan Bossart wrote: >> Perhaps we should add a file global bool that is only set during >> wrapper_handler(). Then we could Assert() or elog(ERROR, ...) if >> pqsignal() is called with it set. > > In older branches that might have been harder (due to forking from a signal > handler and non-fatal errors thrown from signal handlers), but these days I > think that should work. > > FWIW, I don't think elog(ERROR) would be appropriate, that'd be jumping out of > a signal handler :) *facepalm* Yes. > If it were just for the purpose of avoiding the issue you brought up it might > not quite be worth it - but there are a lot of things we want to forbid in a > signal handler. Memory allocations, acquiring locks, throwing non-panic > errors, etc. That's one of the main reasons I like a common wrapper signal > handler. > > Which reminded me of https://postgr.es/m/87msstvper.fsf%40163.com - the set of > things we want to forbid are similar. I'm not sure there's really room to > harmonize things, but I thought I'd raise it. > > Perhaps we should make the state a bitmap and have a single > AssertNotInState(HOLDING_SPINLOCK | IN_SIGNALHANDLER) Seems worth a try. I'll go ahead and proceed with these patches and leave this improvement for another thread. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Dec 18, 2023 at 11:32:47AM -0600, Nathan Bossart wrote: > rebased for cfbot I took a look over each of these. +1 for all. Thank you.
On Wed, Feb 14, 2024 at 11:55:43AM -0800, Noah Misch wrote: > I took a look over each of these. +1 for all. Thank you. Committed. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com